Monday, August 12, 2013


It's not just about getting the data...

I spent today manipulating data. In order to create separate testing and evaluation sets, I split the data I have into chunks, to be mixed and matched to create datasets without any overlapping instances.

Strangely, I have far more purely "guns" articles for training and evaluation than purely "notguns" articles. Even though gun-related articles are far less common than non-gun-related articles, in any given webcrawl there's a chance that some of the articles will be gun-related, which means that all of the articles have to be manually categorized. The only alternative is to collect a list of pages with articles I can be sure do not contain gun-related content. Early on, I found a page that indexed a large number of gun-related articles, which allowed me to collect a good deal of categorized data without having to manually annotate it.

I have begun to create such a list of text, but there have been some problems with the grid lately, so I've had trouble running crawls and processing data in general. My hope is that once these technical difficulties subside, I'll have a chance to collect more data and proceed with feature analysis.

Thursday, August 8, 2013

Classification pt3.

Since my last post I have tested the Naive Bayes classifier over a number of thresholds and plotted the results. I also wrote a Perceptron classifier which I have yet to test on a large sample.

Progress with Naive Bayes

In my previous post, I described how I set up the NB classifier and gathered data to test it. I used a simple less than / greater than comparison of the probabilities to categorize instances as either "guns" or "notguns".

Since then, I have used an order-of-magnitude comparison to categorize instances. I noticed that actual gun-related instances had a much higher p(guns) than the false positives, so I created a threshold order of magnitude that p(guns) / p(notguns) has to be greater than in order to classify as "guns". The chart below shows how precision and recall vary as the classification threshold changes from an 10^.5 to 10^9.5 in increments of 10^5.

The point labels indicate the order of magnitude of the threshold. A threshold of 10^5.5 seems to yield optimal results for this sample size. Some of the labels overlap, because R hates me, but the gist is here.

The optimal threshold depends, of course, on whether we care more about recall or precision, or a balance thereof. If we're just using the classifier as a filter for Mechanical Turk submissions, then having high recall might be a priority.


Ann suggested that before I move on to some black-box ML packages, I get my feet wet with some other linear classifiers besides Naive Bayes. Out of her suggestions, I decided to write a perceptron.

I borrowed the basic code of the perceptron from the Wikipedia page on perceptrons. Since I'm classifying text, however, the simple version wouldn't work, as it was designed to deal with numeric vectors.

I already converted text into numeric data in the NB classifier by creating a hash table of the features and their frequency in the training data. I employed a similar technique for the perceptron by creating a hash table of all the features in the feature space and their weights. By iterating though all of the training data, I set the weights for each feature in the feature space.

One thing to note is that the perceptron can't say anything about features of test data that aren't in the feature space of the training data. NB has the same limitation, but in the perceptron this behavior has to be explicitly coded in, or else an error is raised. I just thought it was interesting.

Preliminary Results

Testing the perceptron is more more difficult than testing the NB classifier, since the training data needs to be annotated. I haven't had a chance to do this in an automated fashion yet, so I only have around 130 training instances. Nevertheless, running against 'wild' data, this nascent perceptron has very high recall and moderate precision. I will post graphs once I play with the threshold and learning rate a little more.

You can find a version of my perceptron code here

Tuesday, August 6, 2013

Classification pt. 2

Last week I got a little frustrated with the classifying script, so I spent some time updating some old crawling programs. Having finished that, I got back to work on classification today, with some encouraging results. Progress has been slowed by some hardware problems which arose as a result of a recent grid upgrade.

Updating old Scripts

For a while now, I had been meaning to update the scripts that I have been using to run the language crawls which were my first project when I started working with Ann. The scripts are not object-oriented, and there were programs submitting multiple jobs to the grid where in turn submitted other jobs... it was a bit of a mess.

Happily, the new crawling script uses object-oriented techniques to make it easy to customize. A python driver script imports the crawling script,, and creates a crawl object with certain parameters (output directory name, number of wget jobs to run at a time, the total amount of time the crawl should run for, etc.). A setup() method creates a file system for the crawl data and crawl() begins the process of submitting wget jobs to download the pages. Only a certain maximum number of wget jobs run at any given time, and a logfile keeps track of what URLs are being crawled and any errors that occur with timestamped entries.

The main advantage to the new script is that someone down the line who might want to run their crawls will have a much easier time implementing what I've written. A number of options can be specified when the crawl instance is created in the driver script. Functions and variables in itself are easy to find (e.g. in __init__() or labeled appropriately).


Today I started looking the classification script again. Last week, I became frustrated when the classifier seemed to be biased towards classifying everything as an example of gun violence. With Ann's help, I now have a better idea of what my training data should look like, and I'll soon be looking at refining my features list.

Preparing the Data

After some confusion over what exactly constitutes a "training instance", I began preparing my training data. First, I crawled a bunch of pages that I knew were about gun violence (see my last post). I then used an old cleaning script to strip away the html tags. Finally, I eliminated lines that contained more than 22% non-word characters.

"Non-word characters" included digits [0-9] and other non-letter characters (e.g. "|","_","-", [tab],[newline], etc.). It turns out most of the articles I wanted to keep had ratio of between 18:100 and 22:100 of these characters compared to total number of characters - a number I determined through trial and error given a fairly large set of sample data (more than 600,000 words).

I ended up with pretty clean data: long strings of text on a lines, with some shorter strings on their own lines, but very little of the ubiquitous web boilerplate (banners, nav-panel text, etc.) Since the Naive-Bayes classifier I'm using counts each newline as an instance, this data was perfect for the "guns" category of training data.

I wanted to gather some non-gun-related data using the same method, but due to some hardware problems (the login nodes weren't automatically mounting the /export drives), I couldn't perform any crawls today. I did, however compile a list of pages that contained no gun-related text - mostly articles from Wikipedia and the Stanford Encyclopedia of Philosophy. I'll crawl these later.

Instead, I took the "arts" and "sports" data that Hilary Mason used for her binary classification and concatenated them into one file of about 100 lines (instances). This then became the "notguns" training data.


Even though the "guns" category had over 20000 training instances, while the "notguns" category had only 198, the classifier did a petty good job.

Using a similar technique of text extraction as I used with the the "guns" training data, I pulled 19 random articles from one of the more recent newspaper snapshots. After manually determining that none of these pertained to gun violence, I removed one article from the "guns" training data and added it to the testing instances.

After training from the "guns" and "notguns" data, I ran the classifier on the testing data and did a simple comparision of the magnitude of the category probabilities. I had the script write the articles to a "guns" file and a "notguns" file depending on the classification. Out of the 20 articles, the classifier successfully identified the single gun-related article, and returned one false positive, classifying the other 18 as "notguns". Here's what the "guns" file looked like (NB: the category probability at the bottom of each paragraph):

(01/03/13) - A Flint teenager has turned himself in, saying he accidentally shot and killed his best friend on New Year's Day. The victim's mother identified him as 15-year-old Gianni Herron. He was found shot in the basement of a home in the 1700 block of North Chevrolet, on the city's northwest side. We are not identifying the alleged shooter, because he is 16 and not charged. He confessed during a news conference, called by Flint pastors, Thursday afternoon. His family members and police were there too.
*pguns: 3.5784633995e-71
*pnotguns: 4.40449325276e-80

01/08/2013 04:08:42 PM MSTLogan County Commissioners Gene Meisner, left, and Rocky Samber were sworn into office by Judge Michael Singer, at the Justice Center on Tuesday. (Callie Jones/Journal-Advocate) STERLING — The new Board of Logan County Commissioners held its first regular meeting Tuesday with newly elected commissioners Gene Meisner and Rocky Samber. Prior to the meeting, both took part in a swearing-in ceremony with other officials, conducted by Chief District Judge Michael Singer, at the Justice Center.
*pguns: 4.47536425751e-62
*pnotguns: 2.47668276083e-62

Notice that that the difference between the probabilities for the false positive (second) instance is roughly 2, whereas for for the true positive it is around 10^9. A slightly more sophisticated comparison (and more data) will hopefully yield a more accurate result.

Interestingly, the false-positive does have a lot of police-related language. I think it will be challenging to discriminate between articles that are gun-related and merely police-related, since most articles that are gun-related are also police-related, but not vice versa.

Looking Ahead

In the remaining two weeks before I head off to become an RA, I'm going to try to improve the classification algorithm I'm currently using (Naive Bayes), and also explore some other possible classification schemes. I will also try to set up a system whereby articles that are downloaded each day are automatically classified - a somewhat ambitious goal, but I think I can manage it if I don't get bogged down too much with other things.

Monday, July 29, 2013


Preliminary Results Look Good

Using an adapted version of Hilary Mason's script, I've started experimenting with classification.

Training Data

I ran a crawl last week of pages in's database of articles about people who have been killed by guns since Newtown. Today, I used my old text-extraction script to scrape the text from those pages and use it as training data. This yielded 1548856 words of gun-related text.


I picked a fairly tricky sentence from one of the gun articles to classify: "Williams said that although she didn’t know Shuford well, he was friends with her son. Detectives do not know of a motive in the crimes".

At first, I copied the text of a few gun-related articles into a file called "guns" which was about the same size as the training data for my other two categories: "arts" and "sports" (provided by Hilary Mason). The classifier gave "guns" a higher probability, but within an order of magnitude of the other categories.

I then dumped all of the training data that I had collected into "guns" and ran the classifier again. This time, the test sentence was was categorized into "guns" with a probability almost 3 orders of magnitude greater than the next closest probability (see below). Note: this classifier uses Porter stemming.

Update: Better Graphs

Better Graphs

These graphs show in a little better detail what our crawls have been yielding. While the frequency of new URLs per site does seem to decrease hyperbolically, the pattern is consistent, i.e. the total number of new URLs is about same for each crawl, and the distribution looks about the same as well.

The graphs

For these graphs, I have omitted frequencies <= 5. There are an average of around 950 sites per crawl that have <= 5 new pages.

Why So Many Zeroes?

After digging through the crawls, I uncovered why so many sites have zero new pages. The reason is that any time there's a broken or moved link, wget either doesn't download anything, or else downloads an "index.html" file of the moved page and then stops. Thus, there are no pages downloaded for that site, meaning no new pages when the pages are checked for redundancy.

As for the pages with very few new URLs, I can't discover anything wrong with the crawls. It just seems that they don't update their content as regularly.

Wednesday, July 24, 2013

Graphs: Update


The following frequency histogram represents the frequency of new URLs in crawled websites for the crawl that started on 7/23. Critically, this table ignores 0 values for frequency. I will write more about this later. Also, apologies for using a different scale than before. I'm still learning how to do this.


The Data So Far

Below are three graphs which show the frequency of new URLs for the sites that we've crawled. The total number of URLs is 2478. A URL is considered "new" if it has not appeared in a previous crawl. Each subsequent crawl is compared to all previous crawls. Surprisingly, the number of sites with 0 new URLs is quite high for each graph. I will have to see whether this is an actual signal or due to some error in my code.

This graph shows that, ignoring the frequency of 0 new URLs, the distribution is fairly mound-shaped with a mean around 375. This is expected, since this was the first crawl, so all the URLs would have been new.

This graph shows that the frequency of sites with many new URLs declines rapidly, which is what we would expect.

This final graph shows an even steeper decline in the frequency of new URLs, which is what we would expect, since the database is growing.

These results are just preliminary, and I intend to spend some time now checking to see if I'm counting everything correctly. Particularly, I am going to investigate the high frequency of pages with 0 new URLs.

Tuesday, July 23, 2013

Update: Counting

Good News and Bad News

So far the page counting has been mostly successful. Mostly.

The Good News

As mentioned in my previous post, the script that counts new pages removes duplicate files. The good news is that I was looking at “new” files from each crawl, and about 80% of them look like articles. That means with minimal screening we should be able to start extracting data.

The Bad News

The bad news is that in testing this script, I accidentally removed the contents of all of the snapshots from before July 11th, and also the one from the 13th. To be fair, this only consisted of five crawls, and they were done with the previous version of the crawling script. That means they hadn’t run for as long, and so had many fewer pages compared to the newer crawls. I felt stupid, but I don’t think it’ll hurt us too much.

On the Bright Side

The database of all previously downloaded pages is a flat tsv file which contains full relative paths to each of the pages it indexes. Each line has the following format:

[Base website URL]    [Page path]    [Website Path (relative)]

So, for example, one line might look like this:    7_17_2013_npsnap/MD/Baltimore/

A simple query to this database would yield full paths for every file downloaded (combining the second and third fields using the first field as a key). A script run from the directory containing the all crawls could easily retrieve all or selected pages for processing.

Once I’ve finished indexing the crawls, I’m going to start extracting pages. In the meantime, I’m learning about machine learning and Alchemy.

Monday, July 22, 2013


I spent a few hours this weekend going through Python tutorials on Codeacademy. While most of the stuff was old hat to me, it was useful to get a sense of what other people consider good style. I also learned a few neat tricks, such as list slicing and lambda functions. Anyway, on with the rest of the post.

Counting Pages

I spent much of last week developing a script that would count the number of new pages in each crawl compared to previous crawls of the same website. This approach is in answer to the mirroring problem previously discussed. Since it seems impractical to extract new content by mirroring each site, we will be continuing with the crawls quaque altera die, from which we will create a list of files which appear in the newest crawl but not in previous crawls. We assume that, give the crawls are of sufficient depth, this method will yield mostly new articles.

So far, the results are encouraging: we seem to be gathering several hundred new URLs per site between crawls. Additionally, the script that counts new pages also removes redundant files to save disk space. A tsv file keeps a record of every URL the counting scripts has seen, and new crawls are compared against this database.

Looking Ahead

Now that it seems that our crawls are yielding usable data, we can start extracting the text. To do this, I'll be learning how to make calls to the Alchemy API using Python.

I will also begin teaching myself about machine learning. In the next few days, I plan to buy Hilary Mason's 'An Introduction to Machine Learning with Web Data'. That should keep me busy for a while

Tuesday, July 16, 2013

Python List Comprehension!


while not all([x.condition for x in obj_array]):
    Do something so powerful!

That is all.

Monday, July 15, 2013


The last week has seen some real challenges in setting up the daily crawls. Most of the challenge has come from the necessity of the following parameters:

  1. Capture New Content
    • We are only interested in capturing new content as it hits the web. Older pages will be picked up in separate Deep Crawls
  2. Timeliness
    • The snapshots need to capture new content daily (or every other day)

Wget: Not an ideal tool

The problem is that wget takes a long time to mirror a site, partly because it proceeds in a breadth-first fashion. Since the articles we're looking for are typically deep in the directory structure of the sites we're crawling, wget needs a long time to find them. This also means it scoops up a large number of pages we don't want, not to mention re-downloading pages we have previously downloaded already.

Additionally, many websites have auto-generated content, which causes wget to tie itself in knots (e.g. downloading each page of a calendar from now until eternity).

In order to combat these woes, I spent a considerable amount of time last week reading forums, advice boards and wget's own formidable man page in order to better understand the tool.

I discovered that wget was designed with time-stamping functionality that, in theory, allows it to only download pages that are new or have been updated since a local copy was created. Unfortunately, this only works if certain information is present in the file header, and since more than 98% of the pages I tested lacked this information, wget defaults to the time of download, which is not very useful.

On the bright side, I learned about a lot of useful options, such as the --spider flag which, when combined with the -r (recursive download) flag, will have wget create a file with all the URLs on a particular site without actually downloading any content. This got me thinking about a possible solution to our problem.

(As a side note, I did look into a tool other than wget for downloads. I found that as poorly-suited wget is to the task at hand, every other tool would be even worse.)

Not Total Kludge...

The current plan is to:

  1. Run wget --spider -r on each website in our list
  2. Create a list of all links that point to files (excluding 'index.html')
  3. Compare this list to the list of files downloaded in the last crawl
  4. Run a non-recursive wget on each of the unique links
  5. Save output to a time-stamped directory system which preserves location information (city, state, etc.)
I worry about the inelegance of this solution, both from an efficiency standpoint and also from the the standpoint of someone else being able to implement it in the future. Since I couldn't think of any other viable solution, however, I began writing implementing this approach

I have been testing what I've written as I go along, and at this point I think I'm fairly close to a working script. There is a problem, however.

Wget Strikes Again!

In order for the approach that I have outlined to work, we will one initial Deep Crawl from which to generate the first Blacklist of URLs. As a test, I started mirroring on Friday. It hasn't stopped yet.

Possible solutions involve using the -R tag, which restricts wget from downloading URLs that contain certain regex-matched strings.

Other Notes

In moments of frustration I have turned to other tasks, which I will briefly discuss now:

I made a major change to the way that jobs are scheduled in the crawling scripts I've been using. Previously, the script used qstat -j to check the job status of every wget job that had been submitted. The purpose of this was to limit the number of jobs running at any one time, but this method was causing delays of up to a minute as the list of jobs previously submitted increased to over 2000. Now, if a job has been completed, the script removes it from the list of job numbers to check, which has eliminated the delay.

I have started migrating the language crawls I did for Ann earlier this year away from the old scripts that have been managing them to the new, Object Oriented scripts. The new scripts aren't working just yet, but they're getting there. The purpose of this is to make it possible for someone down the road to use my scripts as a black box crawling utility. Currently, they would have to go in and edit numerous lines of code in the main script.

In the vain hope that they will be useful, I have been maintaining the daily snapshots that are currently running. I wonder, though, if I should begin removing files we probably aren't going to use

Monday, July 8, 2013

Crawls, Crawls, Crawls

Last week, Ann and I met with Chris to discuss short-term and long-term goals. After spending a few days thinking about how I'm going to proceed, and trying out a few of the options, I'm ready to report

Long Term Goals

The overall structure of the project moving forward is as follows:

Finish Setting up the Crawls

The first order of business is to make sure all the data is coming in as we want it. That means having deep, historical crawls for each of our newspapers, as well as daily snapshots which contain only newly-created content.

Text Extraction

Once we have the data rolling in, we have to make it usable. Chris suggested using an online service called Alchemy. Apparently, it's tailored to just the kind of job we're doing, since it specializes in extracting text from html pages, even if they aren't well- or consistently-written. Chris suggested using this approach over lxml and BeautifulSoup since the structure of the pages won't necessarily be known, and these tools are mostly parsers. Even though we may lose some content (short bits of text like bylines, for instance), Alchemy will probably be our best bet. I have yet to play around with it, however.

Classifying the Data: Using Machine Learning

Once we have workable data, Chris and Ann will help me use some black-box machine learning tools to automatically pick out which articles have to do with gun violence. This will involve picking good keywords, choosing training data and no doubt some hand-annotation. My favorite.

Mechanical Turk

Once we have a fairly reliable way of picking out articles concerning gun violence, the next step will be to set the Turkers loose on the data. This will require brushing up my Java skills, learning about xml and figuring out how Apache Ant works. The questions that we will be asking the Turkers will probably be simple at first (eg: 'Does this article concern an act of gun violence?'), but will, with luck, eventually seek to answer some of the questions I outlined in my previous post.

Short Term Goals

A journey of a thousand miles begins with a single step.

In-Depth Crawls

Using the Object Oriented Python scripts I have developed, I have begun some in-depth crawls. First, I mirrored 3 sample websites: Jim Nocero's 'Gun Report' blog, The Baltimore Sun, and The Oregonian, my hometown paper.

These crawls went pretty well, so I expended the scope to the entire list of websites we got form Marcus at This proves less successful; only pages from a handful of states downloaded.

I think part of the problem is that the crawls didn't have enough time to run. Each website was given less than a minute to mirror, so I changed the crawl frequency to every 2 days instead of every day. I also added a logfile to the crawl which stores information such as the initial parameters of the crawl, and then information about each wget job that is submitted to qsub, including a timestamp. If increasing the time limit doesn't help, I'm hoping this will let me get to the bottom of the the problem.


Setting up the snapshots may prove more challenging. Since we only care about the most recent content, time is a limiting factor. Mirroring each site completely would not only use up disk space, but would take too much time for daily or even every-other-day crawls.

I have read through the wget man page to see if we can use the timestamps of the external pages to determine if we should download them. It looks like the mirroring functionality for wget combined with the timestamping functionality may be useful, however the timestamping seems to work in only a rudimentary fashion. wget -N will check a local version of and download the server version if either the local version doesn't exist, or the server version has been modified more recently. This is all well and good, except we don't want to keep an up-to-date version of the sites we're mirroring, we want to capture *new* content. At this point I see several ways of proceeding:

  1. Decrease snapshot frequency to allow for full-site mirroring. Later, we'll use Python scripts and timestamps to get only new articles
  2. Screw around with wget for a while to see if we can get the functionality we're looking for, i.e. it does the timestamp-checking for us.
    • The problem here is that I don't yet know how to get wget to check remote timestamps against files in crawls/foo/ but download the new files in directory crawls/bar/.
  3. Use a different download tool, such as curl. Currently, I don't think curl is installed on the nodes to which I have access. I also haven't done an research yet on curl's functionality

Tuesday, July 2, 2013

Back In Baltimore

And Now for Something...

I'm back in Baltimore, and Ann has me working on a slightly different project today. I will be annotating a data set of Latin bird names to see if they translate well into their English common names. If this goes well and I finish by 5pm or so, I will do some work on the newspaper crawls I started a fortnight ago.

So far my favorite bird name is the "pyrohypogaster" which translates to "fire below the throat" from Latin.

Sunday, June 23, 2013

Looking at the Data So Far

Good News and Bad News

The crawls have been running for about a week now, and it's time to look at some of the data we've been gathering

The Good News

The crawls seem to be running smoothly. The scripts are working, and all the output is getting to the right places. Some of the qsub output files showed segfault errors, but this is consistent with what has happened in the past, and they occur seemingly at random.

Additionally, from the looks of the sites I've looked at, it seems that having one file per article would be reasonable for most sites. However...

The Not So Good News

After poking though a bunch of crawled papers, it seems that the terminal branch of most directories has only an index.html file in it. Thinking that the wget might have run out of time, I re-ran the crawl on one site without a time limit, and indeed many more pages showed up. I will increase the crawl time a little, but we may have to switch to crawling every other day.

One problem is that, for most newspapers, the actual articles are located at the furthest ends of directory tree. That means that nearly all the other pages, superfluous to us, must be downloaded before we get any of the content we're after.

Another problem is that I'm still not 100% certain how all the wget flags work. When we started, I copied the command from a source that Ann gave me. I think a better understanding of what the --levels flag does will help. Also using -R to filter some pages may speed things up.

Moving Forward

I'm going to be out of town next week, and then the following week I'll be back in Baltimore. When I get back, I aim to focus my attention on the following goals:

  1. Play with wget flags so that I really understand what they do
  2. Refine crawl times so that wget can run long enough to get the pages that we want
  3. Start adapting the cleaning script I've already written to clean up some of the articles we've crawled

Wednesday, June 19, 2013

Background pt. 2

Gun Violence Research from NAP Report

My summary

This information comes from a report issued by the National Academies Press entitled "Priorities for Research to Reduce the Threat of Firearm-Related Violence", published in 2013. Contributors include the Institute of Medicine (IOM) and the National Research Council (NRC).

After reading the article, here are my initial thoughts about what parameters we should look at:

  1. Characteristics of violence
    1. Homicide, suicide, fatal, non-fatal, accidental
    2. Role of controlled substances
    3. Type of firearm / ammunition used
  2. Location
    1. Rural vs. Urban
    2. Type of location
      1. In a home, park, school, etc.
    3. General geographic information
  3. Victim / Perpetrator information
    1. Age, sex, race
    2. Relationship of victim to perpetrator
    3. History of mental illness and other risk factors

My Notes

***(I include these as a sort of summary of parts of the report I thought would be relevant to our study. Page numbers refer to the page of the PDF document I viewed)***

"Applying Public Health Strategies to Reducing Firearm Violence" (p.29)
            This section describes how strategies can be implemented to prevent violence similar to those taken with tobacco/alcohol and motor vehicles.
            "Such strategies are designed to interrupt the connection between three essential elements: the “agent” (the source of injury [weapon or perpetrator]), the “host” (the injured person), and the “environment” (the conditions under which the injury occurred)" (p.29)
                        1. Agent - The source of injury
                        2. Host - The injured person
                        3. Environment - conditions under which injury occurred
            There are 5 areas where more information about gun violence is needed (p.33):
                        1. characteristics of firearm violence,
                        2. risk and protective factors,
                        3. interventions and strategies,
                        4. gun technology, and
                        5. influence of video games and other media.
            [For the purposes of our investigation, I suggest focusing on (1) and (2), which are discussed below.]

"Impact of Existing Federal Restrictions on Firearm Violence Research" (p.34)
            Information is lacking on:
                        1. Gun Sales, ownership, possession
                        2. Names of gun purchasers
"Policy makers need a wide array of information,
including community-level data and data concerning the circumstances of firearm deaths, types of weapons used, victim–offender relationships, role of substance use, and geographic location of injury — none of which is consistently available" (p.35)
                        3. Circumstances of death
                        4. Types of weapons used
                        5. Victim-offender relationships
                        6. Role of substance use
                        7. Geographic information
"Basic information about gun possession, acquisition, and storage is
lacking" (p. 36), [however I don't think this is the kind of information we will be able to gather, so I won't write much about it]
"Data about the sources of guns used in crimes are important because the means of acquisition may reveal opportunities for prevention of firearm related violence" (p.36)
            Currently some information is collected by the ATF
                        Only after a gun is used in a crime, though, and does not track changes in ownership - not representative of crimes
Possible source of information: Weapon-Related Injury Surveillance System (WRISS) which some municipalities use

            Basically, not much is known
            To Look Into:
                        1. Types and number of firearms that exist in the US
                                    "In general, there are three characteristics that define individual guns: gun type, firing action, and ammunition" (p.39)
            Types of Firearm Violence:
                        1. Broad level: fatal or non-fatal
                        2. Fatal: homicides, suicides, homicides, unintentional
                                    a. Mass-shootings sometimes another category
                        3. Non-fatal: unintentional vs. intentional, threats, defensive use,
                                    Though there are cross classifying characteristics, such as age, sex, etc., these categories are useful.

What is known / not known about the following occurrences:
                        Fairly well known:
                                    Urban vs. Rural
                                    Age, Sex, Race
                        Not well known:
                                    Premeditated or Impulsive?
                                    Use of firearm vs. other method
                        Fairly Well known:
                                    Victim-Offender relationship (though still important)
                                                Race, Sex, age, etc.
                                    Domestic violence related shootings
                                    Type of gun used
                        In general, more is known about homicides
            Unintentional Fatalities
                        Fairly well known:
                                    Self inflicted?
                                    Self Defense?
                                    Rural vs. urban
            Mass Shootings
                        Not well known:
                                    Characteristics of suicides associated with mass murders
                        Fairly well known
                                    Intentional vs. unintentional
                                    Self-inflicted vs. other-inflicted
                                    Use in assault (as a threat)

SUMMARY (p45):
Characterize differences in nonfatal and fatal gun use across the
United States. Examples of topics that could be examined:
            1.What are the characteristics of non-self-inflicted fatal and nonfatal gun injury?
                        o What attributes of guns, ammunition, gun users, and other circumstances affect whether a gunshot injury will be fatal or nonfatal?
                        o What characteristics differentiate mass shootings that were prevented from those that were carried out?
                        o What role do firearms play in illicit drug markets?
            2. What are the characteristics of self-inflicted fatal and nonfatal gun injury?
                        o What factors (e.g., storage practices, time of acquisition) affect the decision to use a firearm to inflict self-harm?
                        o To what degree can or would prospective suicidal users of firearms substitute other methods of suicide?
            3. What factors drive trends in firearm-related violence within subpopulations?
            4. What factors could bring about a decrease in unintentional firearm-related deaths?

Situational factors associated with firearm violence (p.48)
            1. Presence of drugs / alcohol
            2. Intent: to acquire money, or as an impulse
                        Need to protect personal status/property
                                    "Some social and psychological research suggests that the need to defend social status may increase the likelihood and severity of response to provocation in the presence of an audience"(Griffiths et al., 2011; Papachristos, 2009) (p.48)
            3. Gang involvement
            4. Other situational factors such as excessive heat (Anderson et al., 1995), the presence of community disorder (or “broken windows”)
            5. Specific locations, e.g.: house/apartment, public street, natural area, vehicle, parked car, athletic area, hotels/motels, commercial areas

Study-proposed research questions (p.50)
            Three important research topics were identified by the committee:
                        1) factors associated with youth having access to, possessing, and carrying guns;
                        2) the impact of gun storage techniques on suicide and unintentional injury, and
                        3) “high-risk” geographic/physical locations for firearm violence.
            Youth Gun Violence [probably can't tackle most of these]
                        Examples of topics that could be examined:
                                    o Which individual and/or situational factors influence the illegal acquisition, carrying, and use of guns by juveniles?
                                    o What types of weapons do youths obtain and carry?
                                    o How do youths acquire these weapons, e.g., through legal or illegal means?
                                    o What are key community-level risk and protective factors(such as the role of social norms), and how are these risk and protective factors affected by the social environment and neighborhood/community context?
                                    o What are key differences between urban and rural youth with regard to risk and protective factors for firearm-related violence?
                        o What are the associated probabilities of thwarting a crime versus committing suicide or sustaining an injury while in possession of a firearm?
                        o What factors affect this risk/benefit relationship of gun ownership and storage techniques?
                        o What is the impact of gun storage methods on the incidence of gun violence—unintentional and intentional—involving both youths and adults?
                        o What is the impact of gun storage techniques on rates of suicide and unintentional injury?
                        1. What are the characteristics of high- and low-risk physical locations?
                        2. Are the locations stable or do they change?
                        3. What factors in the physical and social environment characterize neighborhoods or sub-neighborhoods with higher or lower levels of gun violence?
                        4. Which characteristics strengthen the resilience of specific community locations?
                        5. What is the effect of stress and trauma on community violence, especially firearm-related violence?
                        6. What is the effect of concentrated disadvantage on community violence, especially firearm-related violence?

More information is needed on the effectiveness of intervention programs. Is this something we'll be able to consider? (p. 61).
            Possible factors: Childhood education, poverty, substance use

More information is needed about the effectiveness of gun safety technology

Sunday, June 16, 2013

Changing Tack

No more Newspapermapping

The person at got back to us with the database from his website. While I'm a little sad that 5+hrs of my time have been for naught, I'm glad I don't have to spend another 10+ hours pressing ctrl-c, ctrl-v...

I cleaned up the database by first eliminating all the non-English papers, and then adding "state" back in for about 50 entries that were lacking that field. I removed a handful of links whose connections timed out when I tried to visit their pages. As of right now, a crawl is running on the new URLs.

Thursday, June 13, 2013

Newspapermap Update

States Finished

  1. Washington
  2. Oregon
  3. California
  4. Idaho
  5. Utah
  6. Arazona
  7. Nevada
  8. Montana
  9. Wyoming
  10. Colorado
  11. New Mexico
  12. North Dakota
  13. South Dakota
  14. Nebraska
  15. Kansas
  16. Oklahoma
  17. Texas

I This last go-round I did 194 in an hour!

Also, I am updating the list on the CLSP nodes, so these pages are being crawled as they are added. We have 698 URLs so far

Wednesday, June 12, 2013

Newspapermap Update

States Completed

  • Washington
  • Oregon
  • California
  • Idaho
  • Utah
  • Arizona

Total URLs: 324
Total Time spent: 2.5hrs

I find I can only really do this for an hour at a time, or else I start to go kind of nuts.

Saturday, June 8, 2013

Object Oriented Crawls

A quick update

Today I re-wrote the crawling script in an object-oriented fashion, which took me about 7 hours. I told Ann I would do this later, since the top priority is starting the newspaper crawls, but I figured if I could get a working version by the end of today, I'd have killed two birds with one stone. The old versions of the crawling scripts are still in use for the language crawls, but the way they are written would have made incorporating a new list of newspapers an extremely involved process. I am testing my new version on the newspapers I have thus far culled from and so far all is going well. I'm storing the data on a12, as per Carl's suggestion. Note, I have changed my dating convention. newspaper crawls are labeled:

Time for bed now

Wednesday, June 5, 2013

Newspapermap Update

States completed

  • Oregon
  • Washington
  • Idaho

California is approx. 50% complete. So far, I'm working at about 150 newspapers per hour.

More to come!

Background & Newspapermap - Pt. 1


The fun begins

I started logging information for newspapers in Oregon. At present, I log classify 25 newspapers in 8 minutes (roughly 188 per hour). The most time-consuming part is copying the link from the balloon that pops up when I click on a newspaper's location. I'm thinking about better ways of doing this, but at present, I should be able to finish most of the west coast in a week or so. The east coast will take longer, as there are more newspapers and they're more densely packed - hence more zooming necessary. I'm still trying to think of an effective way of MTurking this.

Statistical Background on Gun Violence

So far

At this stage of the project, I have been focused on ascertaining what information about gun violence was previously collected by the CDC and other government agencies, and how these statistics are gathered.

What the CDC used to collect

I found a report from pre-1997 which focuses on injuries and deaths related to firearms. Additionally, I found a a table in a more general report from 2001 which lists causes of death by "mechanism" - "firearms" is one of the categories. Additionally, this report illustrates some of the "circumstances" of firearm injuries - eg. whether they occur at work, or due to "interpersonal violence". The report also includes this handy graphic:

Handy graphic from the CDC's Surveillance for Fatal and Nonfatal Firearm-Related Injuries --- United States, 1993--1998, (Gotsch et al.)

While this report does speak to how firearms are used, it doesn't really say anything about the demographics of the people involved in these types of incidents. Additionally, I have found no information for any year more recent than 1998.

How they get the data

It turns out, the CDC and the Consumer Product Safety Commission team up to gather data from something called NEISS - the "National Electronic Injury Surveillance System", which is database of information from various hospitals around the country. To give an idea of the sample sizes, the 2001 report above included data from 100 hospitals. NEISS can ostensibly be queried from its website, but when I tried, there was a recurring javascript error.

The Bureau of Justice Statistics also uses data from NEISS to inform its reports. They also run their own program called Firearm Inquiry Statistics (FIST) Program which has information from 1994 - 2005 which includes "[d]ata ... collected directly from state agencies conducting background checks and from local checking agencies and [including] the number of firearm applications made to the agency, firearm applications rejected by the agency, and the reasons for rejection". (Example summary information from the 2005 report: only 1.6% of firearm applications were denied in 2005 - 46% because the requester had a previous felony conviction).

Next Steps

The next steps for me will

Thursday, May 30, 2013

Getting Started

After having some time at home, I'm starting to get back to work. Chris, Ann and I "hung out" via Google+ on Tuesday, and we went over some of the goals for what I'll be doing over the summer.

The project as I understand it is this: use the language classifier to gather data about gun violence in the US from newspapers across the country in order to ameliorate the recent government policy of not collecting gun violence statistics. One of my jobs will be to provide the data - daily crawls of news articles to be fed into the classifier.

Here are my goals at the moment:

  1. Find out what statistics the government currently gathers on gun violence
    • Outcomes of home invasions where firearms are involved
    • Victim ages, race, gender, location, injuries, etc.
    • What agency gathers these statistics? (BJS, FBI, CDC?)
  2. Find out what statistics the CDC and other government agencies gathered before the ban
  3. Start manually cataloging newspapers from
    • Determine feasibility for the entire map
    • Consider MTurk options for compiling the list
  4. Maintain this blog to keep Chris and Ann informed of my progess

Ann also asked me to add a few newspapers to the language crawls I had previously started. In doing this, I realized that I had to update the url counts for each language, and this took a little extra time. At this point, though, I'm fairly certain I've added the URLs to the list successfully, and I'm running a crawl today to make sure