Good News and Bad News
The crawls have been running for about a week now, and it's time to look at some of the data we've been gathering
The Good News
The crawls seem to be running smoothly. The scripts are working, and all the output is getting to the right places. Some of the qsub output files showed segfault errors, but this is consistent with what has happened in the past, and they occur seemingly at random.
Additionally, from the looks of the sites I've looked at, it seems that having one file per article would be reasonable for most sites. However...
The Not So Good News
After poking though a bunch of crawled papers, it seems that the terminal branch of most directories has only an index.html file in it. Thinking that the wget might have run out of time, I re-ran the crawl on one site without a time limit, and indeed many more pages showed up. I will increase the crawl time a little, but we may have to switch to crawling every other day.
One problem is that, for most newspapers, the actual articles are located at the furthest ends of directory tree. That means that nearly all the other pages, superfluous to us, must be downloaded before we get any of the content we're after.
Another problem is that I'm still not 100% certain how all the wget flags work. When we started, I copied the command from a source that Ann gave me. I think a better understanding of what the --levels flag does will help. Also using -R to filter some pages may speed things up.
I'm going to be out of town next week, and then the following week I'll be back in Baltimore. When I get back, I aim to focus my attention on the following goals:
- Play with wget flags so that I really understand what they do
- Refine crawl times so that wget can run long enough to get the pages that we want
- Start adapting the cleaning script I've already written to clean up some of the articles we've crawled