SBWebCamCorder

(c) 1998-2006 Dr. Scott M Baker, smbaker@sb-software.com


Introduction/Purpose

SBWcc is designed to automatically capture information from the World-Wide-Web. It is designed to do the following jobs:

I'll describe each of these situations and when you might want to use SBWcc in a separate section below.

SBWcc has the following special features which are intended to make it easy to use and/or provide special functionality:


Quick Start

Basic WWW Information:

A 'page' which you see in your web browser is typically made up of several different parts:

When your browser loads a page, it first loads the HTML text file. It parses this file and then requests each of the images, each as a separate file. The key point here is that what you see in your web browsers screen is made up of many files.

All web documents are specified by a Uniform Resource Locator (URL). The URL can be thought of as the file's address or "where it lives". To get a particular file, you need to know it's URL. For example, a sample URL is http://www.newsrobot.com/,the address of my website. All WWW URL's begin with the prefix "http://".

Default Setup:

SBWcc is preconfigured for the default option that most people wish to perform: Spidering an entire site and downloading all files to your hard drive. If you wish to configure SBWcc to download just a single file, or a single page with images, then you'll have to change a few configuration options as directed in the following sections. A few common sticking points for novices:

Capturing a single WWW file:

There may be times when you wish to capture one and only one file from the web. This is usually if you know the exact URL of the file that you want. Here's how to do it:

  1. Select the "Run" tab and type the URL into the URL box at the top.
  2. Select the "Setup/Spider" tabs and select the "Single" button in the Spider Method box.
  3. Select the "Run" tab and press <Start>
  4. The URL you entered will be downloaded and stored on your hard drive.

Capturing a page with it's images:

Use this situation if you want to capture a page, along with all of the images stored in the page. This will result in several file -- one file for the page's text, and a JPEG/GIF file for each of the images. Here are the steps:

  1. Select the "Run" tab and type the URL into the URL box.
  2. Select the "Setup/Spider" tabs and select the "Page" button in the Spider Method box
  3. Select the "Run" tab and press <Start>
  4. First the page text will be downloaded and saved as an HTML file.Then, each of the images will be downloaded and saved as either a GIF or JPG file.

Spidering an entire web site:

Spidering refers to the process of requesting a page and then following it's links to get other pages. Spidering can return VERY MANY files since it's hard to tell exactly what will be included ahead of time. Spidering is useful for times when you know that you want to download the entire site to your computer, including all pages, images, and links. Here are the steps:

  1. Select the "Run" tab and type the URL into the URL box.
  2. Select the "Setup/Spider" tabs and select the "Spider" button in the Spider Method box.
  3. Select the "Run" tab and press <Start>
  4. First the page whose URL you entered will be downloaded, followed by all of it's images. Then each of the links will be followed resulting in more pages and more images.

By default, the Spider option is configured to only capture pages in the site which you entered. For example, if you entered "http://www.newsrobot.com/", then only links at www.newsrobot.com would be followed. This prevents the spider algorithm from entering other potentially unintended sites and downloading their content. You can override this behavior by checking the "Allow off-site pages when spidering" option located on the "Setup/Spider" tab. Checking this option may cause SBWcc to run forever, requesting a tremendous number of pages, and it should be used with caution.

Capturing WebCam output:

A WebCam is typically a camera that provides live images to the Internet. There are many different WebCams out there, including:

These cameras are typically accessed via a WWW browser. By browsing to a camera, you can see what is presently occurring at that site, and the image will update automatically as the camera feeds new data to the Internet.

However, sometimes you would like to 'capture' or 'record' images from a WebCam to your personal computer for later viewing. That way, you can look over the data at a later time and see if anything interesting has happened while you were recording. Since the images are all stored to your local hard drive, you can view them much more rapidly offline.

The first step is determining the URL of the camera that you wish to capture. This is important:

You should use the URL of the image itself rather than the URL of the page the images is displayed on.

The way to determine the 'image url' is as follows:

  1. Go to the website in your browser
  2. Find the image of the webcam that you wish to capture. Right-click in the middle of the image. The menu that you get will depend on whether you are using Microsoft Internet Explorer or Netscape Navigator.
  3. If you are using Internet Explorer, then you will see a menu which includes an item called "properties". Click on this. You will be given a dialog that has information for the image, including a line which says "Address (URL)". This is the url of the image
  4. If you are using Netscape Navigator, then you will see a menu which includes an item called "view image". Click on it. The image will expand to full-screen and the url displayed in Netscape Navigator's window will be that of the image.

If you can't figure out the "Image URL", then don't worry -- SBWcc will still work properly if you give it the page URL; You will just download some extra information that isn't really necessary. Here are the steps to being downloading:

  1. Select the "Run" tab and enter the URL into the URL box. Use the Image Url rather than the page URL if possible.
  2. Select the "Setup (Download)" tab. If you used the Image Url in step #1, then check the "single" checkbox, otherwise check the "page" checkbox.
  3. Select the "Setup (General)" tab. You will need to enable the "Auto-Restart" checkbox and enter the approximate delay between pictures in the delay field. I'll talk about this more below.
  4. Select the "Run" tab and press <Start>
  5. SBWcc will download the URL that you have entered.
  6. Once download is complete, SBWcc will pause for an amount of time equal to what you entered in the delay field in step #3. Once the delay is complete, SBWcc will start all over again, and re-download the images, saving the image if it has changed.

Most WebCam's do not update their feeds instantly -- in most cases the feeds are updated at a periodic rate, for example, once per 30 seconds. Thus, it would make no sense to download at a rate faster than once per 30 seconds, since anything faster would yield duplicate images.

This is what the delay setting is for. Set it for what you believe the delay between picture updates is. Most WebCam sites will include an estimate of the delay rate somewhere on their page.

If you attempt to download images faster than the site is updating them, then duplicates will be received. SBWcc automatically finds duplicates and does not save them to your computer.


Setup Options

The setup tab is divided into several sub-tabs representing different aspects of configuration:

Each of these will be discussed below:

General:

Download:

Duplicate:

Spider:

 


Dealing with Referrer Checking

Lot's of websites are using something called the 'referrer header' to check the validity of downloads. This is done to make sure you're not requesting a single image from the site without viewing an entire page (which may contain advertisements and other junk).

For example, let's assume you are viewing a web page, 'www.website.com/page1.html' and it contains the image 'www.website.com/image1.gif' that you want SBWcc to pull down.

On a normal website, you can enter the URL for 'www.website.com/image1.gif' into SBWcc and it'll download just fine. However, on a referrer-checked website, if you enter www.website.com/'image1.gif', it might pop up some kind of 'invalid request' or 'unauthorized request' image instead...

The way around this is to use the 'initial referrer header' setting in SBWcc's advanced settings. To get there, click the "Setup" tab, then the "General" tab, and finally the "Advanced" button. This will bring up the advanced settings dialog box. The thing we're interested in here is the 'initial referrer header'.

In the initial referrer header box, you'll need to type in the full url of the page that contains the image you're trying to download. In our example, this might be 'www.website.com/page1.html'. It'll be different in your case.


Support for Authentication/Password Sites

Many web sites require passwords for access. SBWcc has been designed to support downloading from these sites. You will need to enter the correct password information into SBWcc's authentication settings. Authentication is located under the authentication tag and includes settings for as many sites/urls as you require.

Press the <add> button to add a new authentication setting. You will be prompted for the following:

AdultCheck Sites:

Many people have asked me for information on how to access sites that are protected by the AdultCheck Verification system, which differs from traditional web authentication. The AdultCheck verification system works something like the following:

  1. The user is requested to enter his/her AdultCheck key
  2. When the submit button is pressed, the key is sent to the AdultCheck website for verification
  3. If the key is good, the user's browser is redirected to a web page on the originating system (the pay area)

The key observation is that the URL which the browser in step #3 is directed to is usually unprotected. This URL is what you need to feed into SBWcc in order to download from the protected site. Normally the URL will be displayed by the browser near the top of the browser window. However, this is not always the case -- sites with "frames" typically hide the correct URL from you, and may require a bit of investigation on your part.

It's not my purpose to give you specific instructions on accessing sites, but a good hint is to use your browser to enter the site as normal, determine the URL of the pay area, and then enter that URL into SBWcc.


Filter Options

Several filtering options are built into SBWcc. They are all located under the 'Filter' tab. The first is a box of download file types. These types represent Internet Mime Content Types and loosely relate to file extensions (i.e. Text/Html = .html,.htm and Image/Jpeg = .jpeg,.jpe,.jpg,etc). Unchecking a file type will prevent download of those files to your hard drive.

The second filtering method is by file size. You can set both a minimum and a maximum file size.

It should be noted that in Spider (or Page) mode, SBWcc must always download HTML files. This is because the HTML files contain the hyperlinks that are necessary for spidering. Thus, the filtering options will not prevent HTML files from being download. (Unchecking the Text/Html file type will prevent HTML files from being saved, but not transferred)


Diagnostics

The diagnostics screen is really designed for experienced power users, but even novices can make some use of it. Everytime a failure is encountered, it is displayed in the diagnostic screen. There is also a detailed message listing of what's being sent to and from your computer. If something goes wrong, you can try to figure out what happened from this screen.

Located on the diagnostics screen is the "Q-View" button. A "queue" is really just a fancy term for a list, and these lists contain the files that SBWcc is going to be downloading. By displaying the Q-Viewer, you can get a sneak-peak at the upcoming files to be downloaded and delete some of them if you don't want them. The Queues are always processed in a specific priority: On-site Images, Off-Site Images, On-Site Links, and Off-Site links.

Also located in the diagnostics screen is an advanced button which will let you adjust some specific timeout values. I recommend only messing with it if you have a good reason.


Registration

This program is distributed as Shareware, and Ad-Software (hence the advertisements on the bottom of the page). For more information on my registration policy, see http://www.sb-software.com/credit/

Contacting the Author:

You can reach me via email at smbaker@primenet.com

My website is http://www.sb-software.com

You can find the SBWcc page at http://www.sb-software.com/sbwcc

You can contact me via US mail at

Scott M Baker
2241 W Labriego
Tucson, Az 85741


Revision History