SBWebCamCorder
(c) 1998-2006 Dr. Scott M Baker, smbaker@sb-software.com
Introduction/Purpose
SBWcc is designed to automatically capture information from
the World-Wide-Web. It is designed to do the following jobs:
- Capture any single file
- Capture a page with all of it's embedded images
- 'Spider' a website and capture all of the pages on it
- Capture time-changing data such as WebCams
I'll describe each of these situations and when you might want
to use SBWcc in a separate section below.
SBWcc has the following special features which are intended to
make it easy to use and/or provide special functionality:
- Fully functional shareware -- no crippled features, no
expirations.
- Built in full-screen and thumbnail JPEG image viewers to
show you transfers as they occur in real-time. (Useful
for picture or webcam sites)
- Rejection of data (non-html) files based on byte size --
automatically skip files that are too big or too small.
- Ability to detect duplicates and/or rename files of
different content. (Useful for capturing WebCam's in
which the same file changes over time)
- Manual queue viewer and editor to allow power users to
manually delete unwanted pages before they're downloaded
- You may limit how deep lengths are followed
- You may limit link following to the forward direction
only if desired
Quick Start
Basic WWW Information:
A 'page' which you see in your web browser is typically made
up of several different parts:
- The text of the page itself, in
HTML format. (i.e. the main file, specified by the URL
you provide)
- Embedded images, each stored as
a separate file
- Links which are URL's to other
pages.
When your browser loads a page, it first loads the HTML text
file. It parses this file and then requests each of the images,
each as a separate file. The key point here is that what you see
in your web browsers screen is made up of many files.
All web documents are specified by a Uniform Resource Locator
(URL). The URL can be thought of as the file's address or
"where it lives". To get a particular file, you need to
know it's URL. For example, a sample URL is http://www.newsrobot.com/,the
address of my website. All WWW URL's begin with the prefix "http://".
Default Setup:
SBWcc is preconfigured for the default option that most people
wish to perform: Spidering an entire site and downloading all
files to your hard drive. If you wish to configure SBWcc to
download just a single file, or a single page with images, then
you'll have to change a few configuration options as directed in
the following sections. A few common sticking points for novices:
- By default, SBWcc will replicate the web server's
directory structure on your hard drive. Therefore, if
your download path is set to c:\program files\sbwcc, and
you download the page http://www.myhost.com/dir1/page1,
then the page will end up on your hard drive in
c:\program files\sbwcc\dir1\page1. The point: if you're
not too experienced with the Windows file system, then
you might get a bit confused navigating to the files you
want -- my suggestion is to use windows explorer and
"explore" around if you get confused.
- You can also make SBWcc stuff all of the files into the
same directory by unchecking an option in the
Setup/Download page. This eliminates the above problem,
but means that everything you download is lumped together
(kind of a mess!)
- Spidering entire sites tends to take a lot of time and
downloads a lot of files. Try to limit your http://
specification to be as precise as possible to get the
files you want.
Capturing a single WWW file:
There may be times when you wish to capture one and
only one file from the web. This is usually if you know
the exact URL of the file that you want. Here's how to do it:
- Select the "Run" tab and type the URL into the
URL box at the top.
- Select the "Setup/Spider" tabs and select the
"Single" button in the Spider Method box.
- Select the "Run" tab and press <Start>
- The URL you entered will be downloaded and stored on your
hard drive.
Capturing a page with it's images:
Use this situation if you want to capture a page, along with
all of the images stored in the page. This will result in several
file -- one file for the page's text, and a JPEG/GIF file for
each of the images. Here are the steps:
- Select the "Run" tab and type the URL into the
URL box.
- Select the "Setup/Spider" tabs and select the
"Page" button in the Spider Method box
- Select the "Run" tab and press <Start>
- First the page text will be downloaded and saved as an
HTML file.Then, each of the images will be downloaded and
saved as either a GIF or JPG file.
Spidering an entire web site:
Spidering refers to the process of requesting a page and then
following it's links to get other pages. Spidering can return VERY
MANY files since it's hard to tell exactly what will be
included ahead of time. Spidering is useful for times when you
know that you want to download the entire site to your computer,
including all pages, images, and links. Here are the steps:
- Select the "Run" tab and type the URL into the
URL box.
- Select the "Setup/Spider" tabs and select the
"Spider" button in the Spider Method box.
- Select the "Run" tab and press <Start>
- First the page whose URL you entered will be downloaded,
followed by all of it's images. Then each of the links
will be followed resulting in more pages and more images.
By default, the Spider option is configured to only capture
pages in the site which you entered. For example, if you entered "http://www.newsrobot.com/",
then only links at www.newsrobot.com
would be followed. This prevents the spider algorithm from
entering other potentially unintended sites and downloading their
content. You can override this behavior by checking the
"Allow off-site pages when spidering" option located on
the "Setup/Spider" tab. Checking this option may cause
SBWcc to run forever, requesting a tremendous number of pages,
and it should be used with caution.
Capturing WebCam output:
A WebCam is typically a camera that provides live images to
the Internet. There are many different WebCams out there,
including:
- Scenery Cameras
- Office Cameras
- Strange/Bizarre Cameras (Pets, Peoples Houses, etc)
- Adult Cameras
These cameras are typically accessed via a WWW browser. By
browsing to a camera, you can see what is presently occurring at
that site, and the image will update automatically as the camera
feeds new data to the Internet.
However, sometimes you would like to 'capture' or 'record'
images from a WebCam to your personal computer for later viewing.
That way, you can look over the data at a later time and see if
anything interesting has happened while you were recording. Since
the images are all stored to your local hard drive, you can view
them much more rapidly offline.
The first step is determining the URL of the camera that you
wish to capture. This is important:
You should use the URL of the image itself rather
than the URL of the page the images is displayed on.
The way to determine the 'image url' is as follows:
- Go to the website in your browser
- Find the image of the webcam that you wish to capture. Right-click
in the middle of the image. The menu that
you get will depend on whether you are using Microsoft
Internet Explorer or Netscape Navigator.
- If you are using Internet Explorer, then you will see a
menu which includes an item called
"properties". Click on this. You will be given
a dialog that has information for the image, including a
line which says "Address (URL)". This is the
url of the image
- If you are using Netscape Navigator, then you will see a
menu which includes an item called "view
image". Click on it. The image will expand to
full-screen and the url displayed in Netscape Navigator's
window will be that of the image.
If you can't figure out the "Image URL", then don't
worry -- SBWcc will still work properly if you give it the page
URL; You will just download some extra information that isn't
really necessary. Here are the steps to being downloading:
- Select the "Run" tab and enter the URL into the
URL box. Use the Image Url rather than the page URL if
possible.
- Select the "Setup (Download)" tab. If you used
the Image Url in step #1, then check the
"single" checkbox, otherwise check the
"page" checkbox.
- Select the "Setup (General)" tab. You will need
to enable the "Auto-Restart" checkbox and enter
the approximate delay between pictures in the delay
field. I'll talk about this more below.
- Select the "Run" tab and press <Start>
- SBWcc will download the URL that you have entered.
- Once download is complete, SBWcc will pause for an amount
of time equal to what you entered in the delay field in
step #3. Once the delay is complete, SBWcc will start all
over again, and re-download the images, saving the image
if it has changed.
Most WebCam's do not update their feeds instantly -- in most
cases the feeds are updated at a periodic rate, for example, once
per 30 seconds. Thus, it would make no sense to download at a
rate faster than once per 30 seconds, since anything faster would
yield duplicate images.
This is what the delay setting is for. Set it for what you
believe the delay between picture updates is. Most WebCam sites
will include an estimate of the delay rate somewhere on their
page.
If you attempt to download images faster than the site is
updating them, then duplicates will be received. SBWcc
automatically finds duplicates and does not save them to your
computer.
Setup Options
The setup tab is divided into several sub-tabs representing
different aspects of configuration:
- General: Miscellaneous Options
- Download: Download directory & related settings
- Duplicates: The duplicate checker
- Spider: Options for spidering entire web sites
Each of these will be discussed below:
General:
- Auto-restart. Primarily intended
for sites such as webcams where the content changes each
time you request it. Auto-restart will cause SBWcc to
re-download the page after the specified interval has
elapsed.
- Proxy Server. Proxy servers are
used in certain LAN settings where one computer acts as
an intermediary to the web for all computers on the LAN.
Download:
- Download Path. The download path
is the location on your computer where all downloaded
files will be placed.
- Append URL to path. Checking this
option will cause a directory structure to be created on
your hard drive which matches the structure on the
website. Unchecking this option will put all of the files
you've downloaded into a single directory.
Duplicate:
- Duplicate Checker. This controls
the mode of the duplicate checker. It may be set to
"do not reject duplicates" in which case no
duplicates are rejected, "reject all
duplicates" in which case all duplicates are
rejected, or "only reject not-modified". If the
later option is checked, then SBWcc will verify each
duplicate with the webserver to see if it has changed.
Although this still requires a significant amount of
time, it is still much quicker than re-downloading
everything.
- URL's Remembered. This specifies
the number of URL's that are currently in the duplicate
checker database. You may use the clear button to purge
any or all of them.
Spider:
- Spider Method. This controls how
the spider algorithm functions:
- Single: Only the page you specify will be
downloaded.
- Page: The page you specified, plus all of it's
images will be downloaded, but none of the
hyperlinks will be followed.
- Spider: SBWcc will download the page you
specified, all of it's images, and then follow
each hyperlink. This will continue until all
hyperlinks have been exhausted.
- Spider Options
- Off-site images. If the
current page includes an image which is located
on another server, that image will be downloaded.
- Off-site pages. If the
current page includes a hyperlink to a page which
is located on another server, that page will be
downloaded.
- Forward links only. If
checked, then only links in directories deeper
than the current directory will be followed. For
example, assume the current page is http://foo/bar/sam,
then http://foo/bar/sam/joe
would be followed, but http://foo/bar/tom
would not.
- Treat linked images as inlines.
Many picture oriented sites include links to
JPEG/GIF images; When you click on the link you
get a full-screen image. However, this image is
still technically a hyperlink, and SBWcc would
normally consider it like a normal hyperlink. If
this option is checked, then JPEG/GIF images
included as hyperlinks get a special status and
are treated just like inline images. (Good for
picture sites)
- Maximum Link Depth. SBWcc keeps
track of how deep it gets as it follows each link. In
spider mode, SBWcc will continue to follow links until it
exhausts them all, which can take a very long (or
infinite) time. You can specify a limit on how many links
deep SBWcc will go -- this places a bound on how far out
of hand things can get.
Dealing with Referrer Checking
Lot's of websites are using something called the 'referrer
header' to check the validity of downloads. This is done to make
sure you're not requesting a single image from the site without
viewing an entire page (which may contain advertisements and
other junk).
For example, let's assume you are viewing a web page,
'www.website.com/page1.html' and it contains the image
'www.website.com/image1.gif' that you want SBWcc to pull down.
On a normal website, you can enter the URL for
'www.website.com/image1.gif' into SBWcc and it'll download just
fine. However, on a referrer-checked website, if you enter
www.website.com/'image1.gif', it might pop up some kind of
'invalid request' or 'unauthorized request' image instead...
The way around this is to use the 'initial referrer header'
setting in SBWcc's advanced settings. To get there, click the
"Setup" tab, then the "General" tab, and
finally the "Advanced" button. This will bring up the
advanced settings dialog box. The thing we're interested in here
is the 'initial referrer header'.
In the initial referrer header box, you'll need to type in the
full url of the page that contains the image
you're trying to download. In our example, this might be
'www.website.com/page1.html'. It'll be different in your case.
Support for Authentication/Password Sites
Many web sites require passwords for access. SBWcc has been
designed to support downloading from these sites. You will need
to enter the correct password information into SBWcc's
authentication settings. Authentication is located under the
authentication tag and includes settings for as many sites/urls
as you require.
Press the <add> button to add a new authentication
setting. You will be prompted for the following:
- Host Name/Url Substring. This is
a text string which uniquely identifies the URL's to
which the authentication will apply. For example, if the
site you want to access is "www.mysite.com",
then you could simply enter the hostname "www.mysite.com"
into the box. Some sites may have multiple
authentications for different sections. For example, you
could have a separate password for "www.mysite.com/section1/"
and "www.mysite.com/section2/".
In this cases, just enter the longer url strings to
differentiate between them. A good tip is to keep the
Host Name/Url Substring as short as possible to uniquely
identify the authenticated pages. If a simple hostname
will work (i.e. "www.mysite.com"),
then just use that. There's no reason to get fancy.
- Authentication 'user name'. This
is the authentication user name, as assigned to you by
the site administrator.
- Authentication 'password'. This
is the password, as assigned to you by the site
administrator.
AdultCheck Sites:
Many people have asked me for information on how to access
sites that are protected by the AdultCheck Verification system,
which differs from traditional web authentication. The AdultCheck
verification system works something like the following:
- The user is requested to enter his/her AdultCheck key
- When the submit button is pressed, the key is sent to the
AdultCheck website for verification
- If the key is good, the user's browser is redirected to a
web page on the originating system (the pay area)
The key observation is that the URL which the browser in step
#3 is directed to is usually unprotected. This URL is what you
need to feed into SBWcc in order to download from the protected
site. Normally the URL will be displayed by the browser near the
top of the browser window. However, this is not always the case
-- sites with "frames" typically hide the correct URL
from you, and may require a bit of investigation on your part.
It's not my purpose to give you specific instructions on
accessing sites, but a good hint is to use your browser to enter
the site as normal, determine the URL of the pay area, and then
enter that URL into SBWcc.
Filter Options
Several filtering options are built into SBWcc. They are all
located under the 'Filter' tab. The first is a box of download
file types. These types represent Internet Mime Content Types and
loosely relate to file extensions (i.e. Text/Html = .html,.htm
and Image/Jpeg = .jpeg,.jpe,.jpg,etc). Unchecking a file type
will prevent download of those files to your hard drive.
The second filtering method is by file size. You can set both
a minimum and a maximum file size.
It should be noted that in Spider (or Page) mode, SBWcc must
always download HTML files. This is because the HTML files
contain the hyperlinks that are necessary for spidering. Thus,
the filtering options will not prevent HTML files from being
download. (Unchecking the Text/Html file type will prevent HTML
files from being saved, but not transferred)
Diagnostics
The diagnostics screen is really designed for experienced
power users, but even novices can make some use of it. Everytime
a failure is encountered, it is displayed in the diagnostic
screen. There is also a detailed message listing of what's being
sent to and from your computer. If something goes wrong, you can
try to figure out what happened from this screen.
Located on the diagnostics screen is the "Q-View"
button. A "queue" is really just a fancy term for a
list, and these lists contain the files that SBWcc is going to be
downloading. By displaying the Q-Viewer, you can get a sneak-peak
at the upcoming files to be downloaded and delete some of them if
you don't want them. The Queues are always processed in a
specific priority: On-site Images, Off-Site Images, On-Site
Links, and Off-Site links.
Also located in the diagnostics screen is an advanced button
which will let you adjust some specific timeout values. I
recommend only messing with it if you have a good reason.
Registration
This program is distributed as Shareware, and Ad-Software
(hence the advertisements on the bottom of the page). For more
information on my registration policy, see http://www.sb-software.com/credit/
Contacting the Author:
You can reach me via email at smbaker@primenet.com
My website is http://www.sb-software.com
You can find the SBWcc page at http://www.sb-software.com/sbwcc
You can contact me via US mail at
Scott M Baker
2241 W Labriego
Tucson, Az 85741
Revision History
- Version 1.0
- Version 1.1
- Fixed problem with jpeg viewer not always
displaying correct file
- Support in thumbnail viewer for GIF files
- Added duplicate checking
- Version 1.2
- Added 'delete' to viewer (use DEL key)
- Support for authentication
- More robust error handling
- Version 1.3
- Fixed problem with assert() on systems with large
fonts
- Added note about demographic information
- Version 1.4
- Fixed delete in viewer not working
- Min/Max file size settings now ignore html files
- Fixed glitch in http error handling code
- Version 1.5
- Added more mime content types to filter section
- Included winsock error code on winsock errors
- Added host name to error log and dialog
- Robustized bmp viewer
- Added advanced error settings dialog
- Added error box timeout to advanced error
settings dialog
- Added support for relative redirects
- Parser was erroneously removing trailing slashes
- Added timeout mechanism (diagnostics, advanced
button)
- Version 1.6
- Fixed links that had a ':' in them somewhere
- Expand amp,quot,lt,gt escapes in links
- Added minimize button
- Fixed problem with wrong dialog box size on
systems with large fonts
- Added browser id setting
- Added advanced general options dialog
- Reduced adsoftware flicker problem
- Version 1.7
- Better explanation text for download path options
- Added support for BASE tags
- Added Q-View
- Better handling of redirect
- Better handling of stopped transfers
- Version 1.8
- Fixed restart delay wasn't being saved between
sessions
- Viewer now remembers previous mode and slideshow
delay
- Viewer remembers previous position
- Added support for fake virtual addressing
- Added maximum link depth
- Added links-as-images option
- Added forward-only option
- Defaulted spider mode to spider
- Defaulted download mode to include URL
- Partial support for imagemaps
- Version 1.9
- Stopped viewer from thinking it's sortpics
- Fixed links-as-images flag
- Viewer now opens last filename automatically
- Version 2.0
- Added tools tab
- Added use bookmark command
- Added open image viewer command
- Viewer: added rename (F2)
- Fixed viewer resize problems with large fonts
- Fixed buffer overrun problem with status msg
- Version 2.1
- Transparent gif support in viewer
- fixed numerous viewer memory leaks
- New version of html parser, url, and buffer
routines
- Compiled with smalloc
- Eliminate printing of corrupted url in debug
window
- Version 2.2
- Upgrade to adsoftware V3
- Upgrade to BCB 5
- Added referer tracking; Added initial referrer to
advanced options
- Switch from malloc to MALLOC -- fixes bizarre
thumbnail bug
- Designed custom icon for SBWcc in place of
generic delphi icon
- Many updates to built in viewer (see http://www.sortpics.com)
- Version 2.3
- Added try-catch block to setfilename in viewer
- Removed adsoftware; wcc is now shareware
- Version 2.4
- Changed registration price
- Added receive log
- Better handling of invalid jpeg files in jpeg
viewer
- Version 2.5
- Fixed crash when receiving very long URLs
- Recognize & report replies that don't have status lines
- Report error #501 as Not Supported error
- Add error codes to "Unknown Error" status messages
- Fixed crash on URLs/headers that have %s in them
- Fixed crash on requests with really long headers
- Added advanced page to setup page
- Version 2.6
- Added spot to enter reg code in wait screen
- Switch to TSBSocket code
- support for https websites
- Added checkbox to control following of https from http
- Fix bogus error messages when connection failed and mime-type filtering enabled
- Change default download dir to program_dir\download
- Added button to clear received files list
- Added accept: */* header