NAME
    sitescooper - download news from web sites and convert it
    automatically into one of several formats suitable for viewing
    on a Palm handheld.

SYNOPSIS
    sitescooper [options] [ [-site sitename] ...]

    sitescooper [options] [-sites sitename ...]

    sitescooper [options] [-name nm] [-levels n] [-storyurl regexp]
    [-set sitefileparam value] url [...]

    Options: [-debug] [-refresh] [-fullrefresh] [-config file] [-
    install dir] [-instapp app] [-dump] [-dumpprc] [-nowrite] [-
    nodates] [-quiet] [-admin cmd] [-nolinkrewrite] [-stdout-to
    file] [-badcache] [-keep-tmps] [-fromcache] [-noheaders] [-
    nofooters] [-outputtemplate file.tmpl] [-grep] [-profile
    file.nhp] [-profiles file.nhp file2.nhp ...] [-filename
    template] [-prctitle template] [-parallel] [-disc] [-limit
    numkbytes] [-maxlinks numlinks] [-maxstories numstories]

    [-text | -html | -mhtml | -doc | -plucker | -mplucker | -isilo |
    -misilo | -richreader | -pipe fmt command] [-bw | -color] [-
    cvtargs args_for_converter]

DESCRIPTION
    This script, in conjunction with its configuration file and its
    set of site files, will download news stories from several top
    news sites into text format and/or onto your Palm handheld (with
    the aid of the makedoc/MakeDocW or iSilo utilities).

    Alternatively URLs can be supplied on the command line, in which
    case those URLs will be downloaded and converted using a
    reasonable set of default settings.

    HTTP and local files, using the `file:///' protocol, are both
    supported.

    Multiple types of sites are supported:

        1-level sites, where the text to be converted is all present
        on one page (such as Slashdot, Linux Weekly News, BluesNews,
        NTKnow, Ars Technica);

        2-level sites, where the text to be converted is linked to
        from a Table of Contents page (such as Wired News, BBC News,
        and I, Cringely);

        3-level sites, where the text to be converted is linked to
        from a Table of Contents page, which in turned is linked to
        from a list of issues page (such as PalmPower).

    In addition sites that post news as items on one big page, such
    as Slashdot, Ars Technica, and BluesNews, are supported using
    diff.

    Note that at this moment in time, the URLs-on-the-command-line
    invocation format does not support 2- or 3-level sites.

    The script is portable to most UNIX variants that support perl,
    as well as the Win32 platform (tested with ActivePerl 5.00502
    build 509).

    sitescooper maintains a cache in its temporary directory; files
    are kept in this cache for a week at most. Ditto for the text
    output directory (set with TextSaveDir in the built-in
    configuration).

    If a password is required for the site, and the current
    sitescooper session is interactive, the user will be prompted
    for the username and password. This authentication token will be
    saved for later use. This way a site that requires login can be
    set up as a .site -- just log in once, and your password is
    saved for future non-interactive runs.

    Note however that the encryption used to hide the password in
    the sitescooper configuration is pretty transparent; I recommend
    that rather than using your own username and password to log in
    to passworded sites, a dedicated, sitescooper account is used
    instead.

OPTIONS
    -refresh
        Refresh all links -- ignore the already_seen file, do not
        diff pages, and always fetch links. If a cached page is
        available, it will be used.

    -fullrefresh
        Refresh all links -- ignore the already_seen file, do not
        diff pages, and always fetch links, even if they are
        available in the cache.

    -config file
        Read the configuration from file instead of using the built-
        in one.

    -limit numkbytes
        Set the limit for output file size to numkbytes kilobytes,
        instead of the default 200K. A limit of 0 means unlimited,
        any amount of output.

    -maxlinks numlinks
        Stop retrieving web pages after numlinks have been
        traversed. This is not used to specify how "deep" a site
        should be scooped -- it is the number of links followed in
        total.

    -maxstories numstories
        Stop retrieving web pages after numstories stories have been
        retrieved.

    -install dir
        The directory to save PDB files to once they've been
        converted, in order to have them installed to your Palm
        handheld.

    -instapp app
        The application to run to install PDB files onto your Palm,
        once they've been converted.

    -site sitename
        Limit the run to the site named in the sitename argument.
        Normally all available sites will be downloaded. To limit
        the run to 2 or more sites, provide multiple -site arguments
        like so:

                -site ntk.site -site tbtf.site

    -sites sitename [...]
        Limit the run to multiple sites; an easier way to specify
        multiple sites than using the -site argument for each file.

    -grep
        Use James Brown's NewsHound profile searching code. Any
        sites that do not contain IgnoreProfiles: 1 will then be
        searched for the active profiles. Active profiles are loaded
        from the ProfileDir specified in the sitescooper
        configuration file, or specified using the -profile or -
        profiles arguments.

    -profile file.nhp
        Limit the run to the site named in the file.nhp argument.
        Normally all available sites will be downloaded. To limit
        the run to 2 or more sites, provide multiple -profile
        arguments like so:

                -profile ntk.site -profile tbtf.site

    -profiles file.nhp [...]
        Limit the run to multiple sites; an easier way to specify
        multiple sites than using the -profile argument for each
        file.

    -name name
        When specifying a URL on the command-line, this provides the
        name that should be used when installing the site to the
        Pilot. It acts exactly the same way as the Name: field in a
        site file.

    -levels n
        When specifying a URL on the command-line, this indicates
        how many levels a site has. Not needed when using .site
        files.

    -storyurl regexp
        When specifying a URL on the command-line, this indicates
        the regular expression which links to stories should conform
        to. Not needed when using .site files.

    -doc
        Convert the page(s) downloaded into DOC format, with all the
        articles listed in full, one after the other.

    -text
        Convert the page(s) downloaded into plain text format, with
        all the articles listed in full, one after the other.

    -html
        Convert the page(s) downloaded into HTML format, on one big
        page, with a table of contents (taken from the site if
        possible), followed by all the articles one after another.

    -mhtml
        Convert the page(s) downloaded into HTML format, but retain
        the multiple-page format. This will create the output in a
        directory called site_name; in conjunction with the -dump
        argument, it will output the path of this directory on
        standard output before exiting.

    -plucker
        Convert the page(s) downloaded into Plucker format (see
        http://plucker.gnu-designs.com/ ), on one big page. The
        page(s) will be displayed with a table of contents (taken
        from the site if possible), followed by all the articles one
        after another.

    -isilo
        Convert the page(s) downloaded into iSilo format (see
        http://www.isilo.com/ ), on one big page. This is the
        default. The page(s) will be displayed with a table of
        contents (taken from the site if possible), followed by all
        the articles one after another.

    -misilo
        Convert the page(s) downloaded into iSilo format (see
        http://www.isilo.com/ ), with one iSilo document per site,
        with each story on a separate page. The iSilo document will
        have a table-of-contents page, taken from the site if
        possible, with each article on a separate page.

    -richreader
        Convert the page(s) downloaded into RichReader format using
        HTML2Doc.exe (see
        http://users.erols.com/arenakm/palm/RichReader.html ). The
        page(s) will be displayed with a table of contents (taken
        from the site if possible), followed by all the articles one
        after another.

    -pipe fmt command
        Convert the page(s) downloaded into an arbitrary format,
        using the command provided. Sitescooper will still rewrite
        the page(s) according to the fmt argument, which should be
        one of:

    text    Plain text format.

    html    HTML in one big page.

    mhtml   HTML in multiple pages.

        The command argument can contain `__SCOOPFILE__', which will
        be replaced with the filename of the file containing the
        rewritten pages in the above format, `__SYNCFILE__', which
        will be replaced with a suitable filename in the Palm
        synchronization folder, and `__TITLE__', which will be
        replaced by the title of the file (generally a string
        containing the date and site name).

        Note that for the -mhtml switch, `__SCOOPFILE__' will be
        replaced with the name of the file containing the table-of-
        contents page. It's up to the conversion utility to follow
        the href links to the other files in that directory.

    -cvtargs
        Arguments for the conversion utility. For example, Plucker
        will display images better on some Palms using "-cvtargs --
        bpp=2" or "-cvtargs --bpp=4".

    -bw Indicate that the target can display only 2-bit images, black
        and white only. This is generally the default for iSilo and
        Plucker.

    -color
        Indicate that the target can display colour images.

    -fixlinks
        Rewrite links to external sites or unscooped pages as
        underlined text, to differentiate them from links to scooped
        pages. This is the default behaviour for most formats apart
        from -plucker or -mplucker.

    -keeplinks
        Do not rewrite links to external sites or unscooped pages;
        leave them pointing outside the current scoop. However,
        links to other pages that are included in the current scoop,
        are rewritten to point to the scooped pages instead of the
        source URL. This is the default for Plucker (-plucker or -
        mplucker arguments).

    -nolinkrewrite
        Do not rewrite links on scooped documents -- leave them
        exactly as they are. This includes even links to other
        scooped pages. See also -keeplinks).

    -dump
        Output the page(s) downloaded directly to stdout in text or
        HTML format, instead of writing them to files and converting
        each one. This option NO LONGER implies -text, like it used
        to, so to dump text, use -dump -text.

    -dumpprc
        Output the page(s) downloaded directly to stdout, in
        converted format as a PDB file (note: not PRC format!),
        suitable for installation to a Palm handheld.

    -nowrite
        Test mode -- do not write to the cache or already_seen file,
        instead write what would be written normally to a directory
        called new_cache and a new_already_seen file. This is very
        handy when writing a new site file.

    -badcache
        Send some HTTP headers to bypass web caching proxy servers.
        This is generally useful if a web caching proxy server
        somewhere between sitescooper and the target site is
        returning out-of-date files.

    -debug
        Enable debugging output. This output is in addition to the
        usual progress messages.

    -quiet
        Process sites quietly, without printing the usual progress
        messages to STDERR. Warnings about incorrect site files and
        system errors will still be output, however.

    -admin cmd
        Perform an administrative command. This is intended to ease
        the task of writing scripts which use sitescooper output.
        The following admin commands are available:

    dump-sites
            List the sites which would be scooped on a scooping run,
            and their URLs. Instead of scooping any sites,
            sitescooper will exit after performing this task. The
            format is one site per line, with the site file name
            first, a tab, the site's URL, a tab, the site name, a
            tab, and the output filename that would be generated
            without path or extension. For example:

            foobar.site http://www.foobar.com/ Foo Bar
            1999_01_01_Foo_Bar

    journal Write a journal with dumps of the documents as they pass
            through the formatting and stripping steps of the
            scooping process. This is written to a file called
            journal in the sitescooper temporary directory.

    import-cookies file
            Import a Netscape cookies file into sitescooper, so that
            certain sites which require them, can use them. For
            example, the site economist_full.site requires this.
            Here's how to import cookies on a UNIX machine:

            sitescooper.pl -admin import-cookies ~/.netscape/cookies

            and on Windows:

            perl sitescooper.pl -admin import-cookies "C:\Program
            Files\Netscape\Users\Default\cookies.txt"

            Unfortunately, MS Internet Explorer cookies are
            currently unsupported. If you wish to write a patch to
            support them, that'd be great.

    -noheaders
        Do not attach the sitescooper header (URL, site name, and
        navigation links) to each page.

    -nofooters
        Do not attach the sitescooper footer ("copyright retained by
        original authors" blurb) to each page.

    -outputtemplate file.tmpl
        Read the output formatting template from the file file.tmpl.
        This overrides the settings of the -noheaders and -nofooters
        flags. See the OUTPUT TEMPLATES section below for details on
        this.

    -fromcache
        Do not perform any network access, retrieve everything from
        the cache or the shared cache.

    -filename template
        Change the format of output filenames. template contains the
        following keyword strings, which are substituted as follows:

    YYYY    The current year, in 4-digit format.

    MM      The current month number (from 01 to 12), in 2-digit format.

    Mon     The current month name (from Jan to Dec), in 3-letter
            format.

    DD      The current day of the month (from 01 to 31), in 2-digit
            format.

    Day     The current day of the week (from Sun to Sat), in 3-letter
            format.

    hh      The current hour (from 00 to 23), in 2-digit format.

    mm      The current minute (from 00 to 59), in 2-digit format.

    Site    The current site's name.

    Section The section of the current site (now obsolete).

        The default filename template is YYYY_MM_DD_Site.

    -prctitle template
        Change the format of the titles of the resulting PDB files.
        template may contain the same keyword strings as -filename.

        The default PDB title template is YYYY-Mon-DD: Site.

    -nodates
        Do not put the date in the installable file's filename. This
        allows you to automatically overwrite old files with new
        ones when you HotSync. It's a compatibility shortcut for -
        filename Site -prctitle "Site".

    -preload preload_method
        Preload pages using the given preload method. Currently
        supported preload methods are:

    lwp     Use the Perl LWP module to load pages. This is the default,
            and is single-threaded; in other words, each page needs
            to load fully before the next page can be requested.

    fork[n] Use a number of subprocesses running LWP requests to load
            pages. This is multi-threaded, and several pages can be
            loaded at once; however you pay in costs of network
            bandwidth, CPU time and memory used. The optional n
            argument instructs sitescooper to use that number of
            processes; the default n is 4. This is only available on
            UNIX at the moment.

    -disc
        Disconnect a PPP connection once the scooping has finished.
        Currently this code is experimental, and will probably only
        work on Macintoshes. This is off by default.

    -stdout-to file
        Redirect the output of sitescooper into the named file. This
        is needed on Windows NT and 95, where certain combinations
        of perl and Windows do not seem to support the &gt;
        operator.

    -keep-tmps
        Keep temporary files after conversion. Normally the .txt or
        .html rendition of a site is deleted after conversion; this
        option keeps it around.

OUTPUT TEMPLATES
    You can control exactly what HTML or text is written to the
    output file using the -outputtemplate argument. This argument
    takes the name of a file, which is read and parsed to provide
    replacement templates for sitescooper.

    The file is read as a HTML- or XML-style tagged format; so for
    example the template for the main page in HTML format is read
    from between the &lt;htmlmainpage&gt; and &lt;/htmlmainpage&gt;
    tags. The templates that can be defined are as follows:

    htmlmainpage
        The main page, in HTML format; this is used when the -html
        output format, or one based on it (such as -plucker or -
        isilo), is used. It is also used for the -mhtml format's
        main (top-level) page.

    htmlsubpage
        Sub-page, in HTML format; this is used for the -mhtml output
        format's sub pages, ie. pages other than the top-level one.

    htmlstory
        The snippet of HTML encapsulating each story. This is
        included for each piece of snarfed text, in all HTML files.

    textmainpage
        The main page, in text format; this is used when the -text
        output format, or one based on it (such as -doc), is used.

    textsubpage
        Sub-page, in text format; this is currently unused.

    textstory
        The snippet of text encapsulating each story. This is
        included for each piece of snarfed text, in all text-format
        or DOC-format files.

    A sample template file is provided in the file
    `default_templates.html'; this may have been installed in the
    sitescooper install directory, /usr/share/sitescooper, or
    /usr/local/share/sitescooper. Note that the actual templates
    used are not loaded from this file; instead they are
    incorporated inside the sitescooper script, so changing this
    file will have no effect.

INSTALLATION
    To install, edit the script and change the #! line. You may also
    need to (a) change the Pilot install dir if you plan to use the
    pilot installation functionality, and (b) edit the other
    parameters marked with CUSTOMISE in case they need to be
    customised for your site. They should be set to acceptable
    defaults (unless I forgot to comment out the proxy server lines
    I use ;).

EXAMPLES
            sitescooper.pl http://www.ntk.net/

    To snarf the ever-cutting NTKnow newsletter.

            sitescooper.pl -refresh -html http://www.ntk.net/

    To snarf NTKnow, ignoring any previously-read text, and
    producing HTML output.

            sitescooper.pl -refresh -html -site site_samples/tech/ntk.site

    To snarf NTKnow using the site file provided with the main
    distribution, producing HTML output.

ENVIRONMENT
    sitescooper makes use of the `$http_proxy' environment variable,
    if it is set.

AUTHOR
    Justin Mason <jm /at/ jmason.org>

COPYRIGHT
    Copyright (C) 1999-2000 Justin Mason

    This program is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License as
    published by the Free Software Foundation; either version 2 of
    the License, or (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
    General Public License for more details.

    You should have received a copy of the GNU General Public
    License along with this program; if not, write to the Free
    Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
    MA 02111-1307, USA, or read it on the web at
    http://www.gnu.org/copyleft/gpl.html .

SCRIPT CATEGORIES
    The CPAN script category for this script is `Web'. See
    http://www.cpan.org/scripts/ .

PREREQUISITES
    `File::Find' `File::Copy' `File::Path' `FindBin' `Carp' `Cwd'
    `URI::URL' `LWP::UserAgent' `HTTP::Request::Common' `HTTP::Date'
    `HTML::Entities'

    All these can be picked up from CPAN at http://www.cpan.org/ .
    Note that `HTML::Entities' is actually included in one of the
    previous packages, so you do not need to install it separately.

COREQUISITES
    `Win32::TieRegistry' will be used, if running on a Win32
    platform, to find the Pilot Desktop software's installation
    directory. `Algorithm::Diff' to support diffing sites without
    running an external diff application (this is required on Mac
    systems).

README
    Sitescooper downloads news stories from the web and converts
    them to Palm handheld iSilo, DOC or text format for later
    reading on-the-move. Site files and full documentation can be
    found at http://sitescooper.org/ .

