Since my last post I have tested the Naive Bayes classifier over a number of thresholds and plotted the results. I also wrote a Perceptron classifier which I have yet to test on a large sample.
Progress with Naive Bayes
In my previous post, I described how I set up the NB classifier and gathered data to test it. I used a simple less than / greater than comparison of the probabilities to categorize instances as either "guns" or "notguns".
Since then, I have used an order-of-magnitude comparison to categorize instances. I noticed that actual gun-related instances had a much higher p(guns) than the false positives, so I created a threshold order of magnitude that p(guns) / p(notguns) has to be greater than in order to classify as "guns". The chart below shows how precision and recall vary as the classification threshold changes from an 10^.5 to 10^9.5 in increments of 10^5.
The optimal threshold depends, of course, on whether we care more about recall or precision, or a balance thereof. If we're just using the classifier as a filter for Mechanical Turk submissions, then having high recall might be a priority.
Ann suggested that before I move on to some black-box ML packages, I get my feet wet with some other linear classifiers besides Naive Bayes. Out of her suggestions, I decided to write a perceptron.
I borrowed the basic code of the perceptron from the Wikipedia page on perceptrons. Since I'm classifying text, however, the simple version wouldn't work, as it was designed to deal with numeric vectors.
I already converted text into numeric data in the NB classifier by creating a hash table of the features and their frequency in the training data. I employed a similar technique for the perceptron by creating a hash table of all the features in the feature space and their weights. By iterating though all of the training data, I set the weights for each feature in the feature space.
One thing to note is that the perceptron can't say anything about features of test data that aren't in the feature space of the training data. NB has the same limitation, but in the perceptron this behavior has to be explicitly coded in, or else an error is raised. I just thought it was interesting.
Testing the perceptron is more more difficult than testing the NB classifier, since the training data needs to be annotated. I haven't had a chance to do this in an automated fashion yet, so I only have around 130 training instances. Nevertheless, running against 'wild' data, this nascent perceptron has very high recall and moderate precision. I will post graphs once I play with the threshold and learning rate a little more.
You can find a version of my perceptron code here