SBWebCamCorder

Introduction/Purpose

SBWcc is designed to automatically capture information from the World-Wide-Web. It is designed to do the following jobs:

Capture any single file
Capture a page with all of it's embedded images
'Spider' a website and capture all of the pages on it
Capture time-changing data such as WebCams

I'll describe each of these situations and when you might want to use SBWcc in a separate section below.

SBWcc has the following special features which are intended to make it easy to use and/or provide special functionality:

Fully functional shareware -- no crippled features, no expirations.
Built in full-screen and thumbnail JPEG image viewers to show you transfers as they occur in real-time. (Useful for picture or webcam sites)
Rejection of data (non-html) files based on byte size -- automatically skip files that are too big or too small.
Ability to detect duplicates and/or rename files of different content. (Useful for capturing WebCam's in which the same file changes over time)
Manual queue viewer and editor to allow power users to manually delete unwanted pages before they're downloaded
You may limit how deep lengths are followed
You may limit link following to the forward direction only if desired

Quick Start

Basic WWW Information:

A 'page' which you see in your web browser is typically made up of several different parts:

The text of the page itself, in HTML format. (i.e. the main file, specified by the URL you provide)
Embedded images, each stored as a separate file
Links which are URL's to other pages.

When your browser loads a page, it first loads the HTML text file. It parses this file and then requests each of the images, each as a separate file. The key point here is that what you see in your web browsers screen is made up of many files.

All web documents are specified by a Uniform Resource Locator (URL). The URL can be thought of as the file's address or "where it lives". To get a particular file, you need to know it's URL. For example, a sample URL is http://www.newsrobot.com/,the address of my website. All WWW URL's begin with the prefix "http://".

Default Setup:

SBWcc is preconfigured for the default option that most people wish to perform: Spidering an entire site and downloading all files to your hard drive. If you wish to configure SBWcc to download just a single file, or a single page with images, then you'll have to change a few configuration options as directed in the following sections. A few common sticking points for novices:

By default, SBWcc will replicate the web server's directory structure on your hard drive. Therefore, if your download path is set to c:\program files\sbwcc, and you download the page http://www.myhost.com/dir1/page1, then the page will end up on your hard drive in c:\program files\sbwcc\dir1\page1. The point: if you're not too experienced with the Windows file system, then you might get a bit confused navigating to the files you want -- my suggestion is to use windows explorer and "explore" around if you get confused.
You can also make SBWcc stuff all of the files into the same directory by unchecking an option in the Setup/Download page. This eliminates the above problem, but means that everything you download is lumped together (kind of a mess!)
Spidering entire sites tends to take a lot of time and downloads a lot of files. Try to limit your http:// specification to be as precise as possible to get the files you want.

Capturing a single WWW file:

There may be times when you wish to capture one and only one file from the web. This is usually if you know the exact URL of the file that you want. Here's how to do it:

Select the "Run" tab and type the URL into the URL box at the top.
Select the "Setup/Spider" tabs and select the "Single" button in the Spider Method box.
Select the "Run" tab and press <Start>
The URL you entered will be downloaded and stored on your hard drive.

Capturing a page with it's images:

Use this situation if you want to capture a page, along with all of the images stored in the page. This will result in several file -- one file for the page's text, and a JPEG/GIF file for each of the images. Here are the steps:

Select the "Run" tab and type the URL into the URL box.
Select the "Setup/Spider" tabs and select the "Page" button in the Spider Method box
Select the "Run" tab and press <Start>
First the page text will be downloaded and saved as an HTML file.Then, each of the images will be downloaded and saved as either a GIF or JPG file.

Spidering an entire web site:

Spidering refers to the process of requesting a page and then following it's links to get other pages. Spidering can return VERY MANY files since it's hard to tell exactly what will be included ahead of time. Spidering is useful for times when you know that you want to download the entire site to your computer, including all pages, images, and links. Here are the steps:

Select the "Run" tab and type the URL into the URL box.
Select the "Setup/Spider" tabs and select the "Spider" button in the Spider Method box.
Select the "Run" tab and press <Start>
First the page whose URL you entered will be downloaded, followed by all of it's images. Then each of the links will be followed resulting in more pages and more images.

By default, the Spider option is configured to only capture pages in the site which you entered. For example, if you entered "http://www.newsrobot.com/", then only links at www.newsrobot.com would be followed. This prevents the spider algorithm from entering other potentially unintended sites and downloading their content. You can override this behavior by checking the "Allow off-site pages when spidering" option located on the "Setup/Spider" tab. Checking this option may cause SBWcc to run forever, requesting a tremendous number of pages, and it should be used with caution.

Capturing WebCam output:

A WebCam is typically a camera that provides live images to the Internet. There are many different WebCams out there, including:

Scenery Cameras
Office Cameras
Strange/Bizarre Cameras (Pets, Peoples Houses, etc)
Adult Cameras

These cameras are typically accessed via a WWW browser. By browsing to a camera, you can see what is presently occurring at that site, and the image will update automatically as the camera feeds new data to the Internet.

However, sometimes you would like to 'capture' or 'record' images from a WebCam to your personal computer for later viewing. That way, you can look over the data at a later time and see if anything interesting has happened while you were recording. Since the images are all stored to your local hard drive, you can view them much more rapidly offline.

The first step is determining the URL of the camera that you wish to capture. This is important:

You should use the URL of the image itself rather than the URL of the page the images is displayed on.

The way to determine the 'image url' is as follows:

Go to the website in your browser
Find the image of the webcam that you wish to capture. Right-click in the middle of the image. The menu that you get will depend on whether you are using Microsoft Internet Explorer or Netscape Navigator.
If you are using Internet Explorer, then you will see a menu which includes an item called "properties". Click on this. You will be given a dialog that has information for the image, including a line which says "Address (URL)". This is the url of the image
If you are using Netscape Navigator, then you will see a menu which includes an item called "view image". Click on it. The image will expand to full-screen and the url displayed in Netscape Navigator's window will be that of the image.

If you can't figure out the "Image URL", then don't worry -- SBWcc will still work properly if you give it the page URL; You will just download some extra information that isn't really necessary. Here are the steps to being downloading:

Select the "Run" tab and enter the URL into the URL box. Use the Image Url rather than the page URL if possible.
Select the "Setup (Download)" tab. If you used the Image Url in step #1, then check the "single" checkbox, otherwise check the "page" checkbox.
Select the "Setup (General)" tab. You will need to enable the "Auto-Restart" checkbox and enter the approximate delay between pictures in the delay field. I'll talk about this more below.
Select the "Run" tab and press <Start>
SBWcc will download the URL that you have entered.
Once download is complete, SBWcc will pause for an amount of time equal to what you entered in the delay field in step #3. Once the delay is complete, SBWcc will start all over again, and re-download the images, saving the image if it has changed.

Most WebCam's do not update their feeds instantly -- in most cases the feeds are updated at a periodic rate, for example, once per 30 seconds. Thus, it would make no sense to download at a rate faster than once per 30 seconds, since anything faster would yield duplicate images.

This is what the delay setting is for. Set it for what you believe the delay between picture updates is. Most WebCam sites will include an estimate of the delay rate somewhere on their page.

If you attempt to download images faster than the site is updating them, then duplicates will be received. SBWcc automatically finds duplicates and does not save them to your computer.

Setup Options

The setup tab is divided into several sub-tabs representing different aspects of configuration:

General: Miscellaneous Options
Download: Download directory & related settings
Duplicates: The duplicate checker
Spider: Options for spidering entire web sites

Each of these will be discussed below:

General:

Auto-restart. Primarily intended for sites such as webcams where the content changes each time you request it. Auto-restart will cause SBWcc to re-download the page after the specified interval has elapsed.
Proxy Server. Proxy servers are used in certain LAN settings where one computer acts as an intermediary to the web for all computers on the LAN.

Download:

Download Path. The download path is the location on your computer where all downloaded files will be placed.
Append URL to path. Checking this option will cause a directory structure to be created on your hard drive which matches the structure on the website. Unchecking this option will put all of the files you've downloaded into a single directory.

Duplicate:

Duplicate Checker. This controls the mode of the duplicate checker. It may be set to "do not reject duplicates" in which case no duplicates are rejected, "reject all duplicates" in which case all duplicates are rejected, or "only reject not-modified". If the later option is checked, then SBWcc will verify each duplicate with the webserver to see if it has changed. Although this still requires a significant amount of time, it is still much quicker than re-downloading everything.
URL's Remembered. This specifies the number of URL's that are currently in the duplicate checker database. You may use the clear button to purge any or all of them.

Spider:

Spider Method. This controls how the spider algorithm functions:
- Single: Only the page you specify will be downloaded.
- Page: The page you specified, plus all of it's images will be downloaded, but none of the hyperlinks will be followed.
- Spider: SBWcc will download the page you specified, all of it's images, and then follow each hyperlink. This will continue until all hyperlinks have been exhausted.
Spider Options
- Off-site images. If the current page includes an image which is located on another server, that image will be downloaded.
- Off-site pages. If the current page includes a hyperlink to a page which is located on another server, that page will be downloaded.
- Forward links only. If checked, then only links in directories deeper than the current directory will be followed. For example, assume the current page is http://foo/bar/sam, then http://foo/bar/sam/joe would be followed, but http://foo/bar/tom would not.
- Treat linked images as inlines. Many picture oriented sites include links to JPEG/GIF images; When you click on the link you get a full-screen image. However, this image is still technically a hyperlink, and SBWcc would normally consider it like a normal hyperlink. If this option is checked, then JPEG/GIF images included as hyperlinks get a special status and are treated just like inline images. (Good for picture sites)
Maximum Link Depth. SBWcc keeps track of how deep it gets as it follows each link. In spider mode, SBWcc will continue to follow links until it exhausts them all, which can take a very long (or infinite) time. You can specify a limit on how many links deep SBWcc will go -- this places a bound on how far out of hand things can get.

Dealing with Referrer Checking

Lot's of websites are using something called the 'referrer header' to check the validity of downloads. This is done to make sure you're not requesting a single image from the site without viewing an entire page (which may contain advertisements and other junk).

For example, let's assume you are viewing a web page, 'www.website.com/page1.html' and it contains the image 'www.website.com/image1.gif' that you want SBWcc to pull down.

On a normal website, you can enter the URL for 'www.website.com/image1.gif' into SBWcc and it'll download just fine. However, on a referrer-checked website, if you enter www.website.com/'image1.gif', it might pop up some kind of 'invalid request' or 'unauthorized request' image instead...

The way around this is to use the 'initial referrer header' setting in SBWcc's advanced settings. To get there, click the "Setup" tab, then the "General" tab, and finally the "Advanced" button. This will bring up the advanced settings dialog box. The thing we're interested in here is the 'initial referrer header'.

In the initial referrer header box, you'll need to type in the full url of the page that contains the image you're trying to download. In our example, this might be 'www.website.com/page1.html'. It'll be different in your case.

Support for Authentication/Password Sites

Many web sites require passwords for access. SBWcc has been designed to support downloading from these sites. You will need to enter the correct password information into SBWcc's authentication settings. Authentication is located under the authentication tag and includes settings for as many sites/urls as you require.

Press the <add> button to add a new authentication setting. You will be prompted for the following:

Host Name/Url Substring. This is a text string which uniquely identifies the URL's to which the authentication will apply. For example, if the site you want to access is "www.mysite.com", then you could simply enter the hostname "www.mysite.com" into the box. Some sites may have multiple authentications for different sections. For example, you could have a separate password for "www.mysite.com/section1/" and "www.mysite.com/section2/". In this cases, just enter the longer url strings to differentiate between them. A good tip is to keep the Host Name/Url Substring as short as possible to uniquely identify the authenticated pages. If a simple hostname will work (i.e. "www.mysite.com"), then just use that. There's no reason to get fancy.
Authentication 'user name'. This is the authentication user name, as assigned to you by the site administrator.
Authentication 'password'. This is the password, as assigned to you by the site administrator.

AdultCheck Sites:

Many people have asked me for information on how to access sites that are protected by the AdultCheck Verification system, which differs from traditional web authentication. The AdultCheck verification system works something like the following:

The user is requested to enter his/her AdultCheck key
When the submit button is pressed, the key is sent to the AdultCheck website for verification
If the key is good, the user's browser is redirected to a web page on the originating system (the pay area)

The key observation is that the URL which the browser in step #3 is directed to is usually unprotected. This URL is what you need to feed into SBWcc in order to download from the protected site. Normally the URL will be displayed by the browser near the top of the browser window. However, this is not always the case -- sites with "frames" typically hide the correct URL from you, and may require a bit of investigation on your part.

It's not my purpose to give you specific instructions on accessing sites, but a good hint is to use your browser to enter the site as normal, determine the URL of the pay area, and then enter that URL into SBWcc.

Filter Options

Several filtering options are built into SBWcc. They are all located under the 'Filter' tab. The first is a box of download file types. These types represent Internet Mime Content Types and loosely relate to file extensions (i.e. Text/Html = .html,.htm and Image/Jpeg = .jpeg,.jpe,.jpg,etc). Unchecking a file type will prevent download of those files to your hard drive.

The second filtering method is by file size. You can set both a minimum and a maximum file size.

It should be noted that in Spider (or Page) mode, SBWcc must always download HTML files. This is because the HTML files contain the hyperlinks that are necessary for spidering. Thus, the filtering options will not prevent HTML files from being download. (Unchecking the Text/Html file type will prevent HTML files from being saved, but not transferred)

Diagnostics

The diagnostics screen is really designed for experienced power users, but even novices can make some use of it. Everytime a failure is encountered, it is displayed in the diagnostic screen. There is also a detailed message listing of what's being sent to and from your computer. If something goes wrong, you can try to figure out what happened from this screen.

Located on the diagnostics screen is the "Q-View" button. A "queue" is really just a fancy term for a list, and these lists contain the files that SBWcc is going to be downloading. By displaying the Q-Viewer, you can get a sneak-peak at the upcoming files to be downloaded and delete some of them if you don't want them. The Queues are always processed in a specific priority: On-site Images, Off-Site Images, On-Site Links, and Off-Site links.

Also located in the diagnostics screen is an advanced button which will let you adjust some specific timeout values. I recommend only messing with it if you have a good reason.

Registration

This program is distributed as Shareware, and Ad-Software (hence the advertisements on the bottom of the page). For more information on my registration policy, see http://www.sb-software.com/credit/

Contacting the Author:

You can reach me via email at smbaker@primenet.com

My website is http://www.sb-software.com

You can find the SBWcc page at http://www.sb-software.com/sbwcc

You can contact me via US mail at

Scott M Baker
2241 W Labriego
Tucson, Az 85741

Revision History

Version 1.0
- Initial public release
Version 1.1
- Fixed problem with jpeg viewer not always displaying correct file
- Support in thumbnail viewer for GIF files
- Added duplicate checking
Version 1.2
- Added 'delete' to viewer (use DEL key)
- Support for authentication
- More robust error handling
Version 1.3
- Fixed problem with assert() on systems with large fonts
- Added note about demographic information
Version 1.4
- Fixed delete in viewer not working
- Min/Max file size settings now ignore html files
- Fixed glitch in http error handling code
Version 1.5
- Added more mime content types to filter section
- Included winsock error code on winsock errors
- Added host name to error log and dialog
- Robustized bmp viewer
- Added advanced error settings dialog
- Added error box timeout to advanced error settings dialog
- Added support for relative redirects
- Parser was erroneously removing trailing slashes
- Added timeout mechanism (diagnostics, advanced button)
Version 1.6
- Fixed links that had a ':' in them somewhere
- Expand amp,quot,lt,gt escapes in links
- Added minimize button
- Fixed problem with wrong dialog box size on systems with large fonts
- Added browser id setting
- Added advanced general options dialog
- Reduced adsoftware flicker problem
Version 1.7
- Better explanation text for download path options
- Added support for BASE tags
- Added Q-View
- Better handling of redirect
- Better handling of stopped transfers
Version 1.8
- Fixed restart delay wasn't being saved between sessions
- Viewer now remembers previous mode and slideshow delay
- Viewer remembers previous position
- Added support for fake virtual addressing
- Added maximum link depth
- Added links-as-images option
- Added forward-only option
- Defaulted spider mode to spider
- Defaulted download mode to include URL
- Partial support for imagemaps
Version 1.9
- Stopped viewer from thinking it's sortpics
- Fixed links-as-images flag
- Viewer now opens last filename automatically
Version 2.0
- Added tools tab
- Added use bookmark command
- Added open image viewer command
- Viewer: added rename (F2)
- Fixed viewer resize problems with large fonts
- Fixed buffer overrun problem with status msg
Version 2.1
- Transparent gif support in viewer
- fixed numerous viewer memory leaks
- New version of html parser, url, and buffer routines
- Compiled with smalloc
- Eliminate printing of corrupted url in debug window
Version 2.2
- Upgrade to adsoftware V3
- Upgrade to BCB 5
- Added referer tracking; Added initial referrer to advanced options
- Switch from malloc to MALLOC -- fixes bizarre thumbnail bug
- Designed custom icon for SBWcc in place of generic delphi icon
- Many updates to built in viewer (see http://www.sortpics.com)
Version 2.3
- Added try-catch block to setfilename in viewer
- Removed adsoftware; wcc is now shareware
Version 2.4
- Changed registration price
- Added receive log
- Better handling of invalid jpeg files in jpeg viewer
Version 2.5
- Fixed crash when receiving very long URLs
- Recognize & report replies that don't have status lines
- Report error #501 as Not Supported error
- Add error codes to "Unknown Error" status messages
- Fixed crash on URLs/headers that have %s in them
- Fixed crash on requests with really long headers
- Added advanced page to setup page
Version 2.6
- Added spot to enter reg code in wait screen
- Switch to TSBSocket code
- support for https websites
- Added checkbox to control following of https from http
- Fix bogus error messages when connection failed and mime-type filtering enabled
- Change default download dir to program_dir\download
- Added button to clear received files list
- Added accept: */* header