After having some time at home, I'm starting to get back to work. Chris, Ann and I "hung out" via Google+ on Tuesday, and we went over some of the goals for what I'll be doing over the summer.
The project as I understand it is this: use the language classifier to gather data about gun violence in the US from newspapers across the country in order to ameliorate the recent government policy of not collecting gun violence statistics. One of my jobs will be to provide the data - daily crawls of news articles to be fed into the classifier.
Here are my goals at the moment:
- Find out what statistics the government currently gathers on gun violence
- Outcomes of home invasions where firearms are involved
- Victim ages, race, gender, location, injuries, etc.
- What agency gathers these statistics? (BJS, FBI, CDC?)
- Find out what statistics the CDC and other government agencies gathered before the ban
- Start manually cataloging newspapers from newspapermap.com
- Determine feasibility for the entire map
- Consider MTurk options for compiling the list
Ann also asked me to add a few newspapers to the language crawls I had previously started. In doing this, I realized that I had to update the url counts for each language, and this took a little extra time. At this point, though, I'm fairly certain I've added the URLs to the list successfully, and I'm running a crawl today to make sure