Monday, August 12, 2013

Data

It's not just about getting the data...

I spent today manipulating data. In order to create separate testing and evaluation sets, I split the data I have into chunks, to be mixed and matched to create datasets without any overlapping instances.

Strangely, I have far more purely "guns" articles for training and evaluation than purely "notguns" articles. Even though gun-related articles are far less common than non-gun-related articles, in any given webcrawl there's a chance that some of the articles will be gun-related, which means that all of the articles have to be manually categorized. The only alternative is to collect a list of pages with articles I can be sure do not contain gun-related content. Early on, I found a page that indexed a large number of gun-related articles, which allowed me to collect a good deal of categorized data without having to manually annotate it.

I have begun to create such a list of text, but there have been some problems with the grid lately, so I've had trouble running crawls and processing data in general. My hope is that once these technical difficulties subside, I'll have a chance to collect more data and proceed with feature analysis.

Thursday, August 8, 2013

Classification pt3.

Since my last post I have tested the Naive Bayes classifier over a number of thresholds and plotted the results. I also wrote a Perceptron classifier which I have yet to test on a large sample.

Progress with Naive Bayes

In my previous post, I described how I set up the NB classifier and gathered data to test it. I used a simple less than / greater than comparison of the probabilities to categorize instances as either "guns" or "notguns".

Since then, I have used an order-of-magnitude comparison to categorize instances. I noticed that actual gun-related instances had a much higher p(guns) than the false positives, so I created a threshold order of magnitude that p(guns) / p(notguns) has to be greater than in order to classify as "guns". The chart below shows how precision and recall vary as the classification threshold changes from an 10^.5 to 10^9.5 in increments of 10^5.

The point labels indicate the order of magnitude of the threshold. A threshold of 10^5.5 seems to yield optimal results for this sample size. Some of the labels overlap, because R hates me, but the gist is here.

The optimal threshold depends, of course, on whether we care more about recall or precision, or a balance thereof. If we're just using the classifier as a filter for Mechanical Turk submissions, then having high recall might be a priority.

Perceptron

Ann suggested that before I move on to some black-box ML packages, I get my feet wet with some other linear classifiers besides Naive Bayes. Out of her suggestions, I decided to write a perceptron.

I borrowed the basic code of the perceptron from the Wikipedia page on perceptrons. Since I'm classifying text, however, the simple version wouldn't work, as it was designed to deal with numeric vectors.

I already converted text into numeric data in the NB classifier by creating a hash table of the features and their frequency in the training data. I employed a similar technique for the perceptron by creating a hash table of all the features in the feature space and their weights. By iterating though all of the training data, I set the weights for each feature in the feature space.

One thing to note is that the perceptron can't say anything about features of test data that aren't in the feature space of the training data. NB has the same limitation, but in the perceptron this behavior has to be explicitly coded in, or else an error is raised. I just thought it was interesting.

Preliminary Results

Testing the perceptron is more more difficult than testing the NB classifier, since the training data needs to be annotated. I haven't had a chance to do this in an automated fashion yet, so I only have around 130 training instances. Nevertheless, running against 'wild' data, this nascent perceptron has very high recall and moderate precision. I will post graphs once I play with the threshold and learning rate a little more.

You can find a version of my perceptron code here

Tuesday, August 6, 2013

Classification pt. 2

Last week I got a little frustrated with the classifying script, so I spent some time updating some old crawling programs. Having finished that, I got back to work on classification today, with some encouraging results. Progress has been slowed by some hardware problems which arose as a result of a recent grid upgrade.

Updating old Scripts

For a while now, I had been meaning to update the scripts that I have been using to run the language crawls which were my first project when I started working with Ann. The scripts are not object-oriented, and there were programs submitting multiple jobs to the grid where in turn submitted other jobs... it was a bit of a mess.

Happily, the new crawling script uses object-oriented techniques to make it easy to customize. A python driver script imports the crawling script, Crawl.py, and creates a crawl object with certain parameters (output directory name, number of wget jobs to run at a time, the total amount of time the crawl should run for, etc.). A setup() method creates a file system for the crawl data and crawl() begins the process of submitting wget jobs to download the pages. Only a certain maximum number of wget jobs run at any given time, and a logfile keeps track of what URLs are being crawled and any errors that occur with timestamped entries.

The main advantage to the new script is that someone down the line who might want to run their crawls will have a much easier time implementing what I've written. A number of options can be specified when the crawl instance is created in the driver script. Functions and variables in Crawl.py itself are easy to find (e.g. in __init__() or labeled appropriately).

Classification

Today I started looking the classification script again. Last week, I became frustrated when the classifier seemed to be biased towards classifying everything as an example of gun violence. With Ann's help, I now have a better idea of what my training data should look like, and I'll soon be looking at refining my features list.

Preparing the Data

After some confusion over what exactly constitutes a "training instance", I began preparing my training data. First, I crawled a bunch of pages that I knew were about gun violence (see my last post). I then used an old cleaning script to strip away the html tags. Finally, I eliminated lines that contained more than 22% non-word characters.

"Non-word characters" included digits [0-9] and other non-letter characters (e.g. "|","_","-", [tab],[newline], etc.). It turns out most of the articles I wanted to keep had ratio of between 18:100 and 22:100 of these characters compared to total number of characters - a number I determined through trial and error given a fairly large set of sample data (more than 600,000 words).

I ended up with pretty clean data: long strings of text on a lines, with some shorter strings on their own lines, but very little of the ubiquitous web boilerplate (banners, nav-panel text, etc.) Since the Naive-Bayes classifier I'm using counts each newline as an instance, this data was perfect for the "guns" category of training data.

I wanted to gather some non-gun-related data using the same method, but due to some hardware problems (the login nodes weren't automatically mounting the /export drives), I couldn't perform any crawls today. I did, however compile a list of pages that contained no gun-related text - mostly articles from Wikipedia and the Stanford Encyclopedia of Philosophy. I'll crawl these later.

Instead, I took the "arts" and "sports" data that Hilary Mason used for her binary classification and concatenated them into one file of about 100 lines (instances). This then became the "notguns" training data.

Results

Even though the "guns" category had over 20000 training instances, while the "notguns" category had only 198, the classifier did a petty good job.

Using a similar technique of text extraction as I used with the the "guns" training data, I pulled 19 random articles from one of the more recent newspaper snapshots. After manually determining that none of these pertained to gun violence, I removed one article from the "guns" training data and added it to the testing instances.

After training from the "guns" and "notguns" data, I ran the classifier on the testing data and did a simple comparision of the magnitude of the category probabilities. I had the script write the articles to a "guns" file and a "notguns" file depending on the classification. Out of the 20 articles, the classifier successfully identified the single gun-related article, and returned one false positive, classifying the other 18 as "notguns". Here's what the "guns" file looked like (NB: the category probability at the bottom of each paragraph):

(01/03/13) - A Flint teenager has turned himself in, saying he accidentally shot and killed his best friend on New Year's Day. The victim's mother identified him as 15-year-old Gianni Herron. He was found shot in the basement of a home in the 1700 block of North Chevrolet, on the city's northwest side. We are not identifying the alleged shooter, because he is 16 and not charged. He confessed during a news conference, called by Flint pastors, Thursday afternoon. His family members and police were there too.
*pguns: 3.5784633995e-71
*pnotguns: 4.40449325276e-80

01/08/2013 04:08:42 PM MSTLogan County Commissioners Gene Meisner, left, and Rocky Samber were sworn into office by Judge Michael Singer, at the Justice Center on Tuesday. (Callie Jones/Journal-Advocate) STERLING — The new Board of Logan County Commissioners held its first regular meeting Tuesday with newly elected commissioners Gene Meisner and Rocky Samber. Prior to the meeting, both took part in a swearing-in ceremony with other officials, conducted by Chief District Judge Michael Singer, at the Justice Center.
*pguns: 4.47536425751e-62
*pnotguns: 2.47668276083e-62

Notice that that the difference between the probabilities for the false positive (second) instance is roughly 2, whereas for for the true positive it is around 10^9. A slightly more sophisticated comparison (and more data) will hopefully yield a more accurate result.

Interestingly, the false-positive does have a lot of police-related language. I think it will be challenging to discriminate between articles that are gun-related and merely police-related, since most articles that are gun-related are also police-related, but not vice versa.

Looking Ahead

In the remaining two weeks before I head off to become an RA, I'm going to try to improve the classification algorithm I'm currently using (Naive Bayes), and also explore some other possible classification schemes. I will also try to set up a system whereby articles that are downloaded each day are automatically classified - a somewhat ambitious goal, but I think I can manage it if I don't get bogged down too much with other things.

Monday, July 29, 2013

Classification

Preliminary Results Look Good

Using an adapted version of Hilary Mason's script, I've started experimenting with classification.

Training Data

I ran a crawl last week of pages in Slate.com's database of articles about people who have been killed by guns since Newtown. Today, I used my old text-extraction script to scrape the text from those pages and use it as training data. This yielded 1548856 words of gun-related text.

Classification

I picked a fairly tricky sentence from one of the gun articles to classify: "Williams said that although she didn’t know Shuford well, he was friends with her son. Detectives do not know of a motive in the crimes".

At first, I copied the text of a few gun-related articles into a file called "guns" which was about the same size as the training data for my other two categories: "arts" and "sports" (provided by Hilary Mason). The classifier gave "guns" a higher probability, but within an order of magnitude of the other categories.

I then dumped all of the training data that I had collected into "guns" and ran the classifier again. This time, the test sentence was was categorized into "guns" with a probability almost 3 orders of magnitude greater than the next closest probability (see below). Note: this classifier uses Porter stemming.

Update: Better Graphs

Better Graphs

These graphs show in a little better detail what our crawls have been yielding. While the frequency of new URLs per site does seem to decrease hyperbolically, the pattern is consistent, i.e. the total number of new URLs is about same for each crawl, and the distribution looks about the same as well.

The graphs

For these graphs, I have omitted frequencies <= 5. There are an average of around 950 sites per crawl that have <= 5 new pages.


Why So Many Zeroes?

After digging through the crawls, I uncovered why so many sites have zero new pages. The reason is that any time there's a broken or moved link, wget either doesn't download anything, or else downloads an "index.html" file of the moved page and then stops. Thus, there are no pages downloaded for that site, meaning no new pages when the pages are checked for redundancy.

As for the pages with very few new URLs, I can't discover anything wrong with the crawls. It just seems that they don't update their content as regularly.

Wednesday, July 24, 2013

Graphs: Update

UPDATE:

The following frequency histogram represents the frequency of new URLs in crawled websites for the crawl that started on 7/23. Critically, this table ignores 0 values for frequency. I will write more about this later. Also, apologies for using a different scale than before. I'm still learning how to do this.

Graphs

The Data So Far

Below are three graphs which show the frequency of new URLs for the sites that we've crawled. The total number of URLs is 2478. A URL is considered "new" if it has not appeared in a previous crawl. Each subsequent crawl is compared to all previous crawls. Surprisingly, the number of sites with 0 new URLs is quite high for each graph. I will have to see whether this is an actual signal or due to some error in my code.

This graph shows that, ignoring the frequency of 0 new URLs, the distribution is fairly mound-shaped with a mean around 375. This is expected, since this was the first crawl, so all the URLs would have been new.

This graph shows that the frequency of sites with many new URLs declines rapidly, which is what we would expect.

This final graph shows an even steeper decline in the frequency of new URLs, which is what we would expect, since the database is growing.

These results are just preliminary, and I intend to spend some time now checking to see if I'm counting everything correctly. Particularly, I am going to investigate the high frequency of pages with 0 new URLs.