Sunday, June 23, 2013

Looking at the Data So Far

Good News and Bad News

The crawls have been running for about a week now, and it's time to look at some of the data we've been gathering

The Good News

The crawls seem to be running smoothly. The scripts are working, and all the output is getting to the right places. Some of the qsub output files showed segfault errors, but this is consistent with what has happened in the past, and they occur seemingly at random.

Additionally, from the looks of the sites I've looked at, it seems that having one file per article would be reasonable for most sites. However...

The Not So Good News

After poking though a bunch of crawled papers, it seems that the terminal branch of most directories has only an index.html file in it. Thinking that the wget might have run out of time, I re-ran the crawl on one site without a time limit, and indeed many more pages showed up. I will increase the crawl time a little, but we may have to switch to crawling every other day.

One problem is that, for most newspapers, the actual articles are located at the furthest ends of directory tree. That means that nearly all the other pages, superfluous to us, must be downloaded before we get any of the content we're after.

Another problem is that I'm still not 100% certain how all the wget flags work. When we started, I copied the command from a source that Ann gave me. I think a better understanding of what the --levels flag does will help. Also using -R to filter some pages may speed things up.

Moving Forward

I'm going to be out of town next week, and then the following week I'll be back in Baltimore. When I get back, I aim to focus my attention on the following goals:

  1. Play with wget flags so that I really understand what they do
  2. Refine crawl times so that wget can run long enough to get the pages that we want
  3. Start adapting the cleaning script I've already written to clean up some of the articles we've crawled

1 comment:

  1. The bad doesn't sound too bad. Hopefully you can get the crawling sorted out fairly quickly when you return to campus so that we will have some data to play with sooner rather than later.

    See you in Baltimore next week! I'll be in the office all day next Tuesday, so come anytime and we'll get you set up with a desk.