Summer at the CLSP: Crawls, Crawls, Crawls

Last week, Ann and I met with Chris to discuss short-term and long-term goals. After spending a few days thinking about how I'm going to proceed, and trying out a few of the options, I'm ready to report

Long Term Goals

The overall structure of the project moving forward is as follows:

Finish Setting up the Crawls

The first order of business is to make sure all the data is coming in as we want it. That means having deep, historical crawls for each of our newspapers, as well as daily snapshots which contain only newly-created content.

Text Extraction

Once we have the data rolling in, we have to make it usable. Chris suggested using an online service called Alchemy. Apparently, it's tailored to just the kind of job we're doing, since it specializes in extracting text from html pages, even if they aren't well- or consistently-written. Chris suggested using this approach over lxml and BeautifulSoup since the structure of the pages won't necessarily be known, and these tools are mostly parsers. Even though we may lose some content (short bits of text like bylines, for instance), Alchemy will probably be our best bet. I have yet to play around with it, however.

Classifying the Data: Using Machine Learning

Once we have workable data, Chris and Ann will help me use some black-box machine learning tools to automatically pick out which articles have to do with gun violence. This will involve picking good keywords, choosing training data and no doubt some hand-annotation. My favorite.

Mechanical Turk

Once we have a fairly reliable way of picking out articles concerning gun violence, the next step will be to set the Turkers loose on the data. This will require brushing up my Java skills, learning about xml and figuring out how Apache Ant works. The questions that we will be asking the Turkers will probably be simple at first (eg: 'Does this article concern an act of gun violence?'), but will, with luck, eventually seek to answer some of the questions I outlined in my previous post.

Short Term Goals

A journey of a thousand miles begins with a single step.

In-Depth Crawls

Using the Object Oriented Python scripts I have developed, I have begun some in-depth crawls. First, I mirrored 3 sample websites: Jim Nocero's 'Gun Report' blog, The Baltimore Sun, and The Oregonian, my hometown paper.

These crawls went pretty well, so I expended the scope to the entire list of websites we got form Marcus at Newspapermap.com. This proves less successful; only pages from a handful of states downloaded.

I think part of the problem is that the crawls didn't have enough time to run. Each website was given less than a minute to mirror, so I changed the crawl frequency to every 2 days instead of every day. I also added a logfile to the crawl which stores information such as the initial parameters of the crawl, and then information about each wget job that is submitted to qsub, including a timestamp. If increasing the time limit doesn't help, I'm hoping this will let me get to the bottom of the the problem.

Snapshots

Setting up the snapshots may prove more challenging. Since we only care about the most recent content, time is a limiting factor. Mirroring each site completely would not only use up disk space, but would take too much time for daily or even every-other-day crawls.

I have read through the wget man page to see if we can use the timestamps of the external pages to determine if we should download them. It looks like the mirroring functionality for wget combined with the timestamping functionality may be useful, however the timestamping seems to work in only a rudimentary fashion. wget -N www.foo.com/bar will check a local version of www.foo.com/bar and download the server version if either the local version doesn't exist, or the server version has been modified more recently. This is all well and good, except we don't want to keep an up-to-date version of the sites we're mirroring, we want to capture *new* content. At this point I see several ways of proceeding:

Decrease snapshot frequency to allow for full-site mirroring. Later, we'll use Python scripts and timestamps to get only new articles
Screw around with wget for a while to see if we can get the functionality we're looking for, i.e. it does the timestamp-checking for us.

The problem here is that I don't yet know how to get wget to check remote timestamps against files in crawls/foo/ but download the new files in directory crawls/bar/.

Use a different download tool, such as curl. Currently, I don't think curl is installed on the nodes to which I have access. I also haven't done an research yet on curl's functionality

4 comments:

ccbJuly 8, 2013 at 1:52 PM
Here is the Alchemy web site: http://www.alchemyapi.com/

They do all sorts of things. In addition to cleanly extracting text from HTML they also have some cool NLP technology as well. Here's some info on how the do "content scraping": http://www.alchemyapi.com/api/scrape

And here's a demo of their NLP tools:
http://www.alchemyapi.com/api/demo.html
ccbJuly 8, 2013 at 1:53 PM
How many pages did each of your web crawls yield? You can use the unix command

find directoryName / | wc -l

to get a count of all of the files and directories that you collected.

Monday, July 8, 2013

Crawls, Crawls, Crawls