Preliminary Results Look Good
Using an adapted version of Hilary Mason's script, I've started experimenting with classification.
I ran a crawl last week of pages in Slate.com's database of articles about people who have been killed by guns since Newtown. Today, I used my old text-extraction script to scrape the text from those pages and use it as training data. This yielded 1548856 words of gun-related text.
I picked a fairly tricky sentence from one of the gun articles to classify: "Williams said that although she didn’t know Shuford well, he was friends with her son. Detectives do not know of a motive in the crimes".
At first, I copied the text of a few gun-related articles into a file called "guns" which was about the same size as the training data for my other two categories: "arts" and "sports" (provided by Hilary Mason). The classifier gave "guns" a higher probability, but within an order of magnitude of the other categories.
I then dumped all of the training data that I had collected into "guns" and ran the classifier again. This time, the test sentence was was categorized into "guns" with a probability almost 3 orders of magnitude greater than the next closest probability (see below). Note: this classifier uses Porter stemming.