Tuesday, July 23, 2013

Update: Counting

Good News and Bad News

So far the page counting has been mostly successful. Mostly.

The Good News

As mentioned in my previous post, the script that counts new pages removes duplicate files. The good news is that I was looking at “new” files from each crawl, and about 80% of them look like articles. That means with minimal screening we should be able to start extracting data.

The Bad News

The bad news is that in testing this script, I accidentally removed the contents of all of the snapshots from before July 11th, and also the one from the 13th. To be fair, this only consisted of five crawls, and they were done with the previous version of the crawling script. That means they hadn’t run for as long, and so had many fewer pages compared to the newer crawls. I felt stupid, but I don’t think it’ll hurt us too much.

On the Bright Side

The database of all previously downloaded pages is a flat tsv file which contains full relative paths to each of the pages it indexes. Each line has the following format:

[Base website URL]    [Page path]    [Website Path (relative)]

So, for example, one line might look like this:

www.baltimoresun.com    www.baltimoresun.com/classified/realestate/whosonmyblock/index.html    7_17_2013_npsnap/MD/Baltimore/

A simple query to this database would yield full paths for every file downloaded (combining the second and third fields using the first field as a key). A script run from the directory containing the all crawls could easily retrieve all or selected pages for processing.

Once I’ve finished indexing the crawls, I’m going to start extracting pages. In the meantime, I’m learning about machine learning and Alchemy.

1 comment:

  1. I have had similar "whoops" moments before. Luckily it should be reasonably easy to re-create the data. You'll miss the surge in stories about Trayvon Martin and the royal birth, but there will be other news events to check out.