SBWebCamCorder
(c) 1998-2006 Dr. Scott M Baker, smbaker@sb-software.com
Introduction/Purpose
SBWcc is designed to automatically capture information from
the World-Wide-Web. It is designed to do the following jobs:
    - Capture any single file
- Capture a page with all of it's embedded images
- 'Spider' a website and capture all of the pages on it
- Capture time-changing data such as WebCams
I'll describe each of these situations and when you might want
to use SBWcc in a separate section below.
SBWcc has the following special features which are intended to
make it easy to use and/or provide special functionality:
    - Fully functional shareware -- no crippled features, no
        expirations. 
- Built in full-screen and thumbnail JPEG image viewers to
        show you transfers as they occur in real-time. (Useful
        for picture or webcam sites)
- Rejection of data (non-html) files based on byte size --
        automatically skip files that are too big or too small.
- Ability to detect duplicates and/or rename files of
        different content. (Useful for capturing WebCam's in
        which the same file changes over time)
- Manual queue viewer and editor to allow power users to
        manually delete unwanted pages before they're downloaded
- You may limit how deep lengths are followed
- You may limit link following to the forward direction
        only if desired
Quick Start
Basic WWW Information:
A 'page' which you see in your web browser is typically made
up of several different parts:
    - The text of the page itself, in
        HTML format. (i.e. the main file, specified by the URL
        you provide)
- Embedded images, each stored as
        a separate file 
- Links which are URL's to other
        pages. 
When your browser loads a page, it first loads the HTML text
file. It parses this file and then requests each of the images,
each as a separate file. The key point here is that what you see
in your web browsers screen is made up of many files.
All web documents are specified by a Uniform Resource Locator
(URL). The URL can be thought of as the file's address or
"where it lives". To get a particular file, you need to
know it's URL. For example, a sample URL is http://www.newsrobot.com/,the
address of my website. All WWW URL's begin with the prefix "http://". 
Default Setup:
SBWcc is preconfigured for the default option that most people
wish to perform: Spidering an entire site and downloading all
files to your hard drive. If you wish to configure SBWcc to
download just a single file, or a single page with images, then
you'll have to change a few configuration options as directed in
the following sections. A few common sticking points for novices:
    - By default, SBWcc will replicate the web server's
        directory structure on your hard drive. Therefore, if
        your download path is set to c:\program files\sbwcc, and
        you download the page http://www.myhost.com/dir1/page1,
        then the page will end up on your hard drive in
        c:\program files\sbwcc\dir1\page1. The point: if you're
        not too experienced with the Windows file system, then
        you might get a bit confused navigating to the files you
        want -- my suggestion is to use windows explorer and
        "explore" around if you get confused.
- You can also make SBWcc stuff all of the files into the
        same directory by unchecking an option in the
        Setup/Download page. This eliminates the above problem,
        but means that everything you download is lumped together
        (kind of a mess!)
- Spidering entire sites tends to take a lot of time and
        downloads a lot of files. Try to limit your http://
        specification to be as precise as possible to get the
        files you want. 
Capturing a single WWW file:
There may be times when you wish to capture one and
only one file from the web. This is usually if you know
the exact URL of the file that you want. Here's how to do it:
    - Select the "Run" tab and type the URL into the
        URL box at the top.
- Select the "Setup/Spider" tabs and select the
        "Single" button in the Spider Method box.
- Select the "Run" tab and press <Start>
- The URL you entered will be downloaded and stored on your
        hard drive. 
Capturing a page with it's images:
Use this situation if you want to capture a page, along with
all of the images stored in the page. This will result in several
file -- one file for the page's text, and a JPEG/GIF file for
each of the images. Here are the steps:
    - Select the "Run" tab and type the URL into the
        URL box.
- Select the "Setup/Spider" tabs and select the
        "Page" button in the Spider Method box
- Select the "Run" tab and press <Start>
- First the page text will be downloaded and saved as an
        HTML file.Then, each of the images will be downloaded and
        saved as either a GIF or JPG file. 
Spidering an entire web site:
Spidering refers to the process of requesting a page and then
following it's links to get other pages. Spidering can return VERY
MANY files since it's hard to tell exactly what will be
included ahead of time. Spidering is useful for times when you
know that you want to download the entire site to your computer,
including all pages, images, and links. Here are the steps:
    - Select the "Run" tab and type the URL into the
        URL box.
- Select the "Setup/Spider" tabs and select the
        "Spider" button in the Spider Method box.
- Select the "Run" tab and press <Start>
- First the page whose URL you entered will be downloaded,
        followed by all of it's images. Then each of the links
        will be followed resulting in more pages and more images.
    
By default, the Spider option is configured to only capture
pages in the site which you entered. For example, if you entered "http://www.newsrobot.com/",
then only links at www.newsrobot.com
would be followed. This prevents the spider algorithm from
entering other potentially unintended sites and downloading their
content. You can override this behavior by checking the
"Allow off-site pages when spidering" option located on
the "Setup/Spider" tab. Checking this option may cause
SBWcc to run forever, requesting a tremendous number of pages,
and it should be used with caution.
Capturing WebCam output:
A WebCam is typically a camera that provides live images to
the Internet. There are many different WebCams out there,
including:
    - Scenery Cameras
- Office Cameras
- Strange/Bizarre Cameras (Pets, Peoples Houses, etc)
- Adult Cameras
These cameras are typically accessed via a WWW browser. By
browsing to a camera, you can see what is presently occurring at
that site, and the image will update automatically as the camera
feeds new data to the Internet.
However, sometimes you would like to 'capture' or 'record'
images from a WebCam to your personal computer for later viewing.
That way, you can look over the data at a later time and see if
anything interesting has happened while you were recording. Since
the images are all stored to your local hard drive, you can view
them much more rapidly offline.
The first step is determining the URL of the camera that you
wish to capture. This is important:
You should use the URL of the image itself rather
than the URL of the page the images is displayed on. 
The way to determine the 'image url' is as follows:
    - Go to the website in your browser
- Find the image of the webcam that you wish to capture. Right-click
        in the middle of the image. The menu that
        you get will depend on whether you are using Microsoft
        Internet Explorer or Netscape Navigator.
- If you are using Internet Explorer, then you will see a
        menu which includes an item called
        "properties". Click on this. You will be given
        a dialog that has information for the image, including a
        line which says "Address (URL)". This is the
        url of the image
- If you are using Netscape Navigator, then you will see a
        menu which includes an item called "view
        image". Click on it. The image will expand to
        full-screen and the url displayed in Netscape Navigator's
        window will be that of the image. 
If you can't figure out the "Image URL", then don't
worry -- SBWcc will still work properly if you give it the page
URL; You will just download some extra information that isn't
really necessary. Here are the steps to being downloading:
    - Select the "Run" tab and enter the URL into the
        URL box. Use the Image Url rather than the page URL if
        possible.
- Select the "Setup (Download)" tab. If you used
        the Image Url in step #1, then check the
        "single" checkbox, otherwise check the
        "page" checkbox. 
- Select the "Setup (General)" tab. You will need
        to enable the "Auto-Restart" checkbox and enter
        the approximate delay between pictures in the delay
        field. I'll talk about this more below.
- Select the "Run" tab and press <Start>
- SBWcc will download the URL that you have entered. 
- Once download is complete, SBWcc will pause for an amount
        of time equal to what you entered in the delay field in
        step #3. Once the delay is complete, SBWcc will start all
        over again, and re-download the images, saving the image
        if it has changed.
Most WebCam's do not update their feeds instantly -- in most
cases the feeds are updated at a periodic rate, for example, once
per 30 seconds. Thus, it would make no sense to download at a
rate faster than once per 30 seconds, since anything faster would
yield duplicate images. 
This is what the delay setting is for. Set it for what you
believe the delay between picture updates is. Most WebCam sites
will include an estimate of the delay rate somewhere on their
page. 
If you attempt to download images faster than the site is
updating them, then duplicates will be received. SBWcc
automatically finds duplicates and does not save them to your
computer.
Setup Options
The setup tab is divided into several sub-tabs representing
different aspects of configuration:
    - General: Miscellaneous Options
- Download: Download directory & related settings
- Duplicates: The duplicate checker
- Spider: Options for spidering entire web sites
Each of these will be discussed below:
General:
    - Auto-restart. Primarily intended
        for sites such as webcams where the content changes each
        time you request it. Auto-restart will cause SBWcc to
        re-download the page after the specified interval has
        elapsed.
- Proxy Server. Proxy servers are
        used in certain LAN settings where one computer acts as
        an intermediary to the web for all computers on the LAN.
Download:
    - Download Path. The download path
        is the location on your computer where all downloaded
        files will be placed. 
- Append URL to path. Checking this
        option will cause a directory structure to be created on
        your hard drive which matches the structure on the
        website. Unchecking this option will put all of the files
        you've downloaded into a single directory. 
Duplicate:
    - Duplicate Checker. This controls
        the mode of the duplicate checker. It may be set to
        "do not reject duplicates" in which case no
        duplicates are rejected, "reject all
        duplicates" in which case all duplicates are
        rejected, or "only reject not-modified". If the
        later option is checked, then SBWcc will verify each
        duplicate with the webserver to see if it has changed.
        Although this still requires a significant amount of
        time, it is still much quicker than re-downloading
        everything.
- URL's Remembered. This specifies
        the number of URL's that are currently in the duplicate
        checker database. You may use the clear button to purge
        any or all of them.
Spider:
    - Spider Method. This controls how
        the spider algorithm functions:
            - Single: Only the page you specify will be
                downloaded.
- Page: The page you specified, plus all of it's
                images will be downloaded, but none of the
                hyperlinks will be followed.
- Spider: SBWcc will download the page you
                specified, all of it's images, and then follow
                each hyperlink. This will continue until all
                hyperlinks have been exhausted. 
 
- Spider Options
            - Off-site images. If the
                current page includes an image which is located
                on another server, that image will be downloaded.
            
- Off-site pages. If the
                current page includes a hyperlink to a page which
                is located on another server, that page will be
                downloaded.
- Forward links only. If
                checked, then only links in directories deeper
                than the current directory will be followed. For
                example, assume the current page is http://foo/bar/sam,
                then http://foo/bar/sam/joe
                would be followed, but http://foo/bar/tom
                would not. 
- Treat linked images as inlines.
                Many picture oriented sites include links to
                JPEG/GIF images; When you click on the link you
                get a full-screen image. However, this image is
                still technically a hyperlink, and SBWcc would
                normally consider it like a normal hyperlink. If
                this option is checked, then JPEG/GIF images
                included as hyperlinks get a special status and
                are treated just like inline images. (Good for
                picture sites)
 
- Maximum Link Depth. SBWcc keeps
        track of how deep it gets as it follows each link. In
        spider mode, SBWcc will continue to follow links until it
        exhausts them all, which can take a very long (or
        infinite) time. You can specify a limit on how many links
        deep SBWcc will go -- this places a bound on how far out
        of hand things can get. 
 
Dealing with Referrer Checking
Lot's of websites are using something called the 'referrer
header' to check the validity of downloads. This is done to make
sure you're not requesting a single image from the site without
viewing an entire page (which may contain advertisements and
other junk). 
For example, let's assume you are viewing a web page,
'www.website.com/page1.html' and it contains the image
'www.website.com/image1.gif' that you want SBWcc to pull down. 
On a normal website, you can enter the URL for
'www.website.com/image1.gif' into SBWcc and it'll download just
fine. However, on a referrer-checked website, if you enter
www.website.com/'image1.gif', it might pop up some kind of
'invalid request' or 'unauthorized request' image instead... 
The way around this is to use the 'initial referrer header'
setting in SBWcc's advanced settings. To get there, click the
"Setup" tab, then the "General" tab, and
finally the "Advanced" button. This will bring up the
advanced settings dialog box. The thing we're interested in here
is the 'initial referrer header'. 
In the initial referrer header box, you'll need to type in the
full url of the page that contains the image
you're trying to download. In our example, this might be
'www.website.com/page1.html'. It'll be different in your case. 
Support for Authentication/Password Sites
Many web sites require passwords for access. SBWcc has been
designed to support downloading from these sites. You will need
to enter the correct password information into SBWcc's
authentication settings. Authentication is located under the
authentication tag and includes settings for as many sites/urls
as you require. 
Press the <add> button to add a new authentication
setting. You will be prompted for the following:
    - Host Name/Url Substring. This is
        a text string which uniquely identifies the URL's to
        which the authentication will apply. For example, if the
        site you want to access is "www.mysite.com",
        then you could simply enter the hostname "www.mysite.com"
        into the box. Some sites may have multiple
        authentications for different sections. For example, you
        could have a separate password for "www.mysite.com/section1/"
        and "www.mysite.com/section2/".
        In this cases, just enter the longer url strings to
        differentiate between them. A good tip is to keep the
        Host Name/Url Substring as short as possible to uniquely
        identify the authenticated pages. If a simple hostname
        will work (i.e. "www.mysite.com"),
        then just use that. There's no reason to get fancy.
- Authentication 'user name'. This
        is the authentication user name, as assigned to you by
        the site administrator. 
- Authentication 'password'. This
        is the password, as assigned to you by the site
        administrator. 
AdultCheck Sites:
Many people have asked me for information on how to access
sites that are protected by the AdultCheck Verification system,
which differs from traditional web authentication. The AdultCheck
verification system works something like the following:
    - The user is requested to enter his/her AdultCheck key
- When the submit button is pressed, the key is sent to the
        AdultCheck website for verification
- If the key is good, the user's browser is redirected to a
        web page on the originating system (the pay area)
The key observation is that the URL which the browser in step
#3 is directed to is usually unprotected. This URL is what you
need to feed into SBWcc in order to download from the protected
site. Normally the URL will be displayed by the browser near the
top of the browser window. However, this is not always the case
-- sites with "frames" typically hide the correct URL
from you, and may require a bit of investigation on your part. 
It's not my purpose to give you specific instructions on
accessing sites, but a good hint is to use your browser to enter
the site as normal, determine the URL of the pay area, and then
enter that URL into SBWcc. 
Filter Options
Several filtering options are built into SBWcc. They are all
located under the 'Filter' tab. The first is a box of download
file types. These types represent Internet Mime Content Types and
loosely relate to file extensions (i.e. Text/Html = .html,.htm
and Image/Jpeg = .jpeg,.jpe,.jpg,etc). Unchecking a file type
will prevent download of those files to your hard drive. 
The second filtering method is by file size. You can set both
a minimum and a maximum file size. 
It should be noted that in Spider (or Page) mode, SBWcc must
always download HTML files. This is because the HTML files
contain the hyperlinks that are necessary for spidering. Thus,
the filtering options will not prevent HTML files from being
download. (Unchecking the Text/Html file type will prevent HTML
files from being saved, but not transferred)
Diagnostics
The diagnostics screen is really designed for experienced
power users, but even novices can make some use of it. Everytime
a failure is encountered, it is displayed in the diagnostic
screen. There is also a detailed message listing of what's being
sent to and from your computer. If something goes wrong, you can
try to figure out what happened from this screen.
Located on the diagnostics screen is the "Q-View"
button. A "queue" is really just a fancy term for a
list, and these lists contain the files that SBWcc is going to be
downloading. By displaying the Q-Viewer, you can get a sneak-peak
at the upcoming files to be downloaded and delete some of them if
you don't want them. The Queues are always processed in a
specific priority: On-site Images, Off-Site Images, On-Site
Links, and Off-Site links. 
Also located in the diagnostics screen is an advanced button
which will let you adjust some specific timeout values. I
recommend only messing with it if you have a good reason.
Registration
This program is distributed as Shareware, and Ad-Software
(hence the advertisements on the bottom of the page). For more
information on my registration policy, see http://www.sb-software.com/credit/
Contacting the Author:
You can reach me via email at smbaker@primenet.com
My website is http://www.sb-software.com
You can find the SBWcc page at http://www.sb-software.com/sbwcc
You can contact me via US mail at
Scott M Baker
2241 W Labriego
Tucson, Az 85741
Revision History
    - Version 1.0 
    
- Version 1.1 
            - Fixed problem with jpeg viewer not always
                displaying correct file 
- Support in thumbnail viewer for GIF files 
- Added duplicate checking 
 
- Version 1.2
            - Added 'delete' to viewer (use DEL key)
- Support for authentication
- More robust error handling 
 
- Version 1.3
            - Fixed problem with assert() on systems with large
                fonts 
- Added note about demographic information 
 
- Version 1.4 
            - Fixed delete in viewer not working 
- Min/Max file size settings now ignore html files 
- Fixed glitch in http error handling code 
 
- Version 1.5 
            - Added more mime content types to filter section 
- Included winsock error code on winsock errors 
- Added host name to error log and dialog 
- Robustized bmp viewer 
- Added advanced error settings dialog 
- Added error box timeout to advanced error
                settings dialog 
- Added support for relative redirects 
- Parser was erroneously removing trailing slashes 
- Added timeout mechanism (diagnostics, advanced
                button) 
 
- Version 1.6
            - Fixed links that had a ':' in them somewhere 
- Expand amp,quot,lt,gt escapes in links 
- Added minimize button 
- Fixed problem with wrong dialog box size on
                systems with large fonts 
- Added browser id setting 
- Added advanced general options dialog 
- Reduced adsoftware flicker problem 
 
- Version 1.7 
            - Better explanation text for download path options
            
- Added support for BASE tags 
- Added Q-View 
- Better handling of redirect 
- Better handling of stopped transfers 
 
- Version 1.8 
            - Fixed restart delay wasn't being saved between
                sessions 
- Viewer now remembers previous mode and slideshow
                delay 
- Viewer remembers previous position 
- Added support for fake virtual addressing 
- Added maximum link depth 
- Added links-as-images option 
- Added forward-only option 
- Defaulted spider mode to spider 
- Defaulted download mode to include URL 
- Partial support for imagemaps 
 
- Version 1.9 
            - Stopped viewer from thinking it's sortpics 
- Fixed links-as-images flag 
- Viewer now opens last filename automatically 
 
- Version 2.0
            - Added tools tab 
- Added use bookmark command 
- Added open image viewer command 
- Viewer: added rename (F2) 
- Fixed viewer resize problems with large fonts 
- Fixed buffer overrun problem with status msg 
 
- Version 2.1
            - Transparent gif support in viewer
- fixed numerous viewer memory leaks
- New version of html parser, url, and buffer
                routines 
- Compiled with smalloc 
- Eliminate printing of corrupted url in debug
                window 
 
- Version 2.2
            - Upgrade to adsoftware V3 
- Upgrade to BCB 5 
- Added referer tracking; Added initial referrer to
                advanced options
- Switch from malloc to MALLOC -- fixes bizarre
                thumbnail bug 
- Designed custom icon for SBWcc in place of
                generic delphi icon
- Many updates to built in viewer (see http://www.sortpics.com)
            
 
- Version 2.3
            - Added try-catch block to setfilename in viewer 
- Removed adsoftware; wcc is now shareware 
 
- Version 2.4
            - Changed registration price 
- Added receive log 
- Better handling of invalid jpeg files in jpeg
                viewer 
 
- Version 2.5
            - Fixed crash when receiving very long URLs
            
- Recognize & report replies that don't have status lines
            
- Report error #501 as Not Supported error
            
- Add error codes to "Unknown Error" status messages
            
- Fixed crash on URLs/headers that have %s in them
            
- Fixed crash on requests with really long headers
            
- Added advanced page to setup page
        
 
- Version 2.6
            - Added spot to enter reg code in wait screen
            
- Switch to TSBSocket code
            
- support for https websites
            
- Added checkbox to control following of https from http
            
- Fix bogus error messages when connection failed and mime-type filtering enabled
            
- Change default download dir to program_dir\download
            
- Added button to clear received files list
            
- Added accept: */* header