Last week I got a little frustrated with the classifying script, so I spent some time updating some old crawling programs. Having finished that, I got back to work on classification today, with some encouraging results. Progress has been slowed by some hardware problems which arose as a result of a recent grid upgrade.
Updating old Scripts
For a while now, I had been meaning to update the scripts that I have been using to run the language crawls which were my first project when I started working with Ann. The scripts are not object-oriented, and there were programs submitting multiple jobs to the grid where in turn submitted other jobs... it was a bit of a mess.
Happily, the new crawling script uses object-oriented techniques to make it easy to customize. A python driver script imports the crawling script, Crawl.py, and creates a crawl object with certain parameters (output directory name, number of wget jobs to run at a time, the total amount of time the crawl should run for, etc.). A setup() method creates a file system for the crawl data and crawl() begins the process of submitting wget jobs to download the pages. Only a certain maximum number of wget jobs run at any given time, and a logfile keeps track of what URLs are being crawled and any errors that occur with timestamped entries.
The main advantage to the new script is that someone down the line who might want to run their crawls will have a much easier time implementing what I've written. A number of options can be specified when the crawl instance is created in the driver script. Functions and variables in Crawl.py itself are easy to find (e.g. in __init__() or labeled appropriately).
Today I started looking the classification script again. Last week, I became frustrated when the classifier seemed to be biased towards classifying everything as an example of gun violence. With Ann's help, I now have a better idea of what my training data should look like, and I'll soon be looking at refining my features list.
Preparing the Data
After some confusion over what exactly constitutes a "training instance", I began preparing my training data. First, I crawled a bunch of pages that I knew were about gun violence (see my last post). I then used an old cleaning script to strip away the html tags. Finally, I eliminated lines that contained more than 22% non-word characters.
"Non-word characters" included digits [0-9] and other non-letter characters (e.g. "|","_","-", [tab],[newline], etc.). It turns out most of the articles I wanted to keep had ratio of between 18:100 and 22:100 of these characters compared to total number of characters - a number I determined through trial and error given a fairly large set of sample data (more than 600,000 words).
I ended up with pretty clean data: long strings of text on a lines, with some shorter strings on their own lines, but very little of the ubiquitous web boilerplate (banners, nav-panel text, etc.) Since the Naive-Bayes classifier I'm using counts each newline as an instance, this data was perfect for the "guns" category of training data.
I wanted to gather some non-gun-related data using the same method, but due to some hardware problems (the login nodes weren't automatically mounting the /export drives), I couldn't perform any crawls today. I did, however compile a list of pages that contained no gun-related text - mostly articles from Wikipedia and the Stanford Encyclopedia of Philosophy. I'll crawl these later.
Instead, I took the "arts" and "sports" data that Hilary Mason used for her binary classification and concatenated them into one file of about 100 lines (instances). This then became the "notguns" training data.
Even though the "guns" category had over 20000 training instances, while the "notguns" category had only 198, the classifier did a petty good job.
Using a similar technique of text extraction as I used with the the "guns" training data, I pulled 19 random articles from one of the more recent newspaper snapshots. After manually determining that none of these pertained to gun violence, I removed one article from the "guns" training data and added it to the testing instances.
After training from the "guns" and "notguns" data, I ran the classifier on the testing data and did a simple comparision of the magnitude of the category probabilities. I had the script write the articles to a "guns" file and a "notguns" file depending on the classification. Out of the 20 articles, the classifier successfully identified the single gun-related article, and returned one false positive, classifying the other 18 as "notguns". Here's what the "guns" file looked like (NB: the category probability at the bottom of each paragraph):
01/08/2013 04:08:42 PM MSTLogan County Commissioners Gene Meisner, left, and Rocky Samber were sworn into office by Judge Michael Singer, at the Justice Center on Tuesday. (Callie Jones/Journal-Advocate) STERLING — The new Board of Logan County Commissioners held its first regular meeting Tuesday with newly elected commissioners Gene Meisner and Rocky Samber. Prior to the meeting, both took part in a swearing-in ceremony with other officials, conducted by Chief District Judge Michael Singer, at the Justice Center.
Notice that that the difference between the probabilities for the false positive (second) instance is roughly 2, whereas for for the true positive it is around 10^9. A slightly more sophisticated comparison (and more data) will hopefully yield a more accurate result.
Interestingly, the false-positive does have a lot of police-related language. I think it will be challenging to discriminate between articles that are gun-related and merely police-related, since most articles that are gun-related are also police-related, but not vice versa.
In the remaining two weeks before I head off to become an RA, I'm going to try to improve the classification algorithm I'm currently using (Naive Bayes), and also explore some other possible classification schemes. I will also try to set up a system whereby articles that are downloaded each day are automatically classified - a somewhat ambitious goal, but I think I can manage it if I don't get bogged down too much with other things.