I spent a few hours this weekend going through Python tutorials on Codeacademy. While most of the stuff was old hat to me, it was useful to get a sense of what other people consider good style. I also learned a few neat tricks, such as list slicing and lambda functions. Anyway, on with the rest of the post.
I spent much of last week developing a script that would count the number of new pages in each crawl compared to previous crawls of the same website. This approach is in answer to the mirroring problem previously discussed. Since it seems impractical to extract new content by mirroring each site, we will be continuing with the crawls quaque altera die, from which we will create a list of files which appear in the newest crawl but not in previous crawls. We assume that, give the crawls are of sufficient depth, this method will yield mostly new articles.
So far, the results are encouraging: we seem to be gathering several hundred new URLs per site between crawls. Additionally, the script that counts new pages also removes redundant files to save disk space. A tsv file keeps a record of every URL the counting scripts has seen, and new crawls are compared against this database.
Now that it seems that our crawls are yielding usable data, we can start extracting the text. To do this, I'll be learning how to make calls to the Alchemy API using Python.
I will also begin teaching myself about machine learning. In the next few days, I plan to buy Hilary Mason's 'An Introduction to Machine Learning with Web Data'. That should keep me busy for a while