Good News and Bad News
So far the page counting has been mostly successful. Mostly.
The Good News
As mentioned in my previous post, the script that counts new pages removes duplicate files. The good news is that I was looking at “new” files from each crawl, and about 80% of them look like articles. That means with minimal screening we should be able to start extracting data.
The Bad News
The bad news is that in testing this script, I accidentally removed the contents of all of the snapshots from before July 11th, and also the one from the 13th. To be fair, this only consisted of five crawls, and they were done with the previous version of the crawling script. That means they hadn’t run for as long, and so had many fewer pages compared to the newer crawls. I felt stupid, but I don’t think it’ll hurt us too much.
On the Bright Side
The database of all previously downloaded pages is a flat tsv file which contains full relative paths to each of the pages it indexes. Each line has the following format:
So, for example, one line might look like this:
A simple query to this database would yield full paths for every file downloaded (combining the second and third fields using the first field as a key). A script run from the directory containing the all crawls could easily retrieve all or selected pages for processing.
Once I’ve finished indexing the crawls, I’m going to start extracting pages. In the meantime, I’m learning about machine learning and Alchemy.