The last week has seen some real challenges in setting up the daily crawls. Most of the challenge has come from the necessity of the following parameters:
- Capture New Content
- We are only interested in capturing new content as it hits the web. Older pages will be picked up in separate Deep Crawls
- The snapshots need to capture new content daily (or every other day)
Wget: Not an ideal tool
The problem is that wget takes a long time to mirror a site, partly because it proceeds in a breadth-first fashion. Since the articles we're looking for are typically deep in the directory structure of the sites we're crawling, wget needs a long time to find them. This also means it scoops up a large number of pages we don't want, not to mention re-downloading pages we have previously downloaded already.
Additionally, many websites have auto-generated content, which causes wget to tie itself in knots (e.g. downloading each page of a calendar from now until eternity).
In order to combat these woes, I spent a considerable amount of time last week reading forums, advice boards and wget's own formidable man page in order to better understand the tool.
I discovered that wget was designed with time-stamping functionality that, in theory, allows it to only download pages that are new or have been updated since a local copy was created. Unfortunately, this only works if certain information is present in the file header, and since more than 98% of the pages I tested lacked this information, wget defaults to the time of download, which is not very useful.
On the bright side, I learned about a lot of useful options, such as the --spider flag which, when combined with the -r (recursive download) flag, will have wget create a file with all the URLs on a particular site without actually downloading any content. This got me thinking about a possible solution to our problem.
(As a side note, I did look into a tool other than wget for downloads. I found that as poorly-suited wget is to the task at hand, every other tool would be even worse.)
Not Total Kludge...
The current plan is to:
- Run wget --spider -r on each website in our list
- Create a list of all links that point to files (excluding 'index.html')
- Compare this list to the list of files downloaded in the last crawl
- Run a non-recursive wget on each of the unique links
- Save output to a time-stamped directory system which preserves location information (city, state, etc.)
I have been testing what I've written as I go along, and at this point I think I'm fairly close to a working script. There is a problem, however.
Wget Strikes Again!
In order for the approach that I have outlined to work, we will one initial Deep Crawl from which to generate the first Blacklist of URLs. As a test, I started mirroring www.oregonlive.com on Friday. It hasn't stopped yet.
Possible solutions involve using the -R tag, which restricts wget from downloading URLs that contain certain regex-matched strings.
In moments of frustration I have turned to other tasks, which I will briefly discuss now:
I made a major change to the way that jobs are scheduled in the crawling scripts I've been using. Previously, the script used qstat -j to check the job status of every wget job that had been submitted. The purpose of this was to limit the number of jobs running at any one time, but this method was causing delays of up to a minute as the list of jobs previously submitted increased to over 2000. Now, if a job has been completed, the script removes it from the list of job numbers to check, which has eliminated the delay.
I have started migrating the language crawls I did for Ann earlier this year away from the old scripts that have been managing them to the new, Object Oriented scripts. The new scripts aren't working just yet, but they're getting there. The purpose of this is to make it possible for someone down the road to use my scripts as a black box crawling utility. Currently, they would have to go in and edit numerous lines of code in the main script.
In the vain hope that they will be useful, I have been maintaining the daily snapshots that are currently running. I wonder, though, if I should begin removing files we probably aren't going to use