Monday, August 12, 2013


It's not just about getting the data...

I spent today manipulating data. In order to create separate testing and evaluation sets, I split the data I have into chunks, to be mixed and matched to create datasets without any overlapping instances.

Strangely, I have far more purely "guns" articles for training and evaluation than purely "notguns" articles. Even though gun-related articles are far less common than non-gun-related articles, in any given webcrawl there's a chance that some of the articles will be gun-related, which means that all of the articles have to be manually categorized. The only alternative is to collect a list of pages with articles I can be sure do not contain gun-related content. Early on, I found a page that indexed a large number of gun-related articles, which allowed me to collect a good deal of categorized data without having to manually annotate it.

I have begun to create such a list of text, but there have been some problems with the grid lately, so I've had trouble running crawls and processing data in general. My hope is that once these technical difficulties subside, I'll have a chance to collect more data and proceed with feature analysis.

