Monday, July 29, 2013

Classification

Preliminary Results Look Good

Using an adapted version of Hilary Mason's script, I've started experimenting with classification.

Training Data

I ran a crawl last week of pages in Slate.com's database of articles about people who have been killed by guns since Newtown. Today, I used my old text-extraction script to scrape the text from those pages and use it as training data. This yielded 1548856 words of gun-related text.

Classification

I picked a fairly tricky sentence from one of the gun articles to classify: "Williams said that although she didn’t know Shuford well, he was friends with her son. Detectives do not know of a motive in the crimes".

At first, I copied the text of a few gun-related articles into a file called "guns" which was about the same size as the training data for my other two categories: "arts" and "sports" (provided by Hilary Mason). The classifier gave "guns" a higher probability, but within an order of magnitude of the other categories.

I then dumped all of the training data that I had collected into "guns" and ran the classifier again. This time, the test sentence was was categorized into "guns" with a probability almost 3 orders of magnitude greater than the next closest probability (see below). Note: this classifier uses Porter stemming.

Update: Better Graphs

Better Graphs

These graphs show in a little better detail what our crawls have been yielding. While the frequency of new URLs per site does seem to decrease hyperbolically, the pattern is consistent, i.e. the total number of new URLs is about same for each crawl, and the distribution looks about the same as well.

The graphs

For these graphs, I have omitted frequencies <= 5. There are an average of around 950 sites per crawl that have <= 5 new pages.


Why So Many Zeroes?

After digging through the crawls, I uncovered why so many sites have zero new pages. The reason is that any time there's a broken or moved link, wget either doesn't download anything, or else downloads an "index.html" file of the moved page and then stops. Thus, there are no pages downloaded for that site, meaning no new pages when the pages are checked for redundancy.

As for the pages with very few new URLs, I can't discover anything wrong with the crawls. It just seems that they don't update their content as regularly.

Wednesday, July 24, 2013

Graphs: Update

UPDATE:

The following frequency histogram represents the frequency of new URLs in crawled websites for the crawl that started on 7/23. Critically, this table ignores 0 values for frequency. I will write more about this later. Also, apologies for using a different scale than before. I'm still learning how to do this.

Graphs

The Data So Far

Below are three graphs which show the frequency of new URLs for the sites that we've crawled. The total number of URLs is 2478. A URL is considered "new" if it has not appeared in a previous crawl. Each subsequent crawl is compared to all previous crawls. Surprisingly, the number of sites with 0 new URLs is quite high for each graph. I will have to see whether this is an actual signal or due to some error in my code.

This graph shows that, ignoring the frequency of 0 new URLs, the distribution is fairly mound-shaped with a mean around 375. This is expected, since this was the first crawl, so all the URLs would have been new.

This graph shows that the frequency of sites with many new URLs declines rapidly, which is what we would expect.

This final graph shows an even steeper decline in the frequency of new URLs, which is what we would expect, since the database is growing.

These results are just preliminary, and I intend to spend some time now checking to see if I'm counting everything correctly. Particularly, I am going to investigate the high frequency of pages with 0 new URLs.

Tuesday, July 23, 2013

Update: Counting

Good News and Bad News

So far the page counting has been mostly successful. Mostly.

The Good News

As mentioned in my previous post, the script that counts new pages removes duplicate files. The good news is that I was looking at “new” files from each crawl, and about 80% of them look like articles. That means with minimal screening we should be able to start extracting data.

The Bad News

The bad news is that in testing this script, I accidentally removed the contents of all of the snapshots from before July 11th, and also the one from the 13th. To be fair, this only consisted of five crawls, and they were done with the previous version of the crawling script. That means they hadn’t run for as long, and so had many fewer pages compared to the newer crawls. I felt stupid, but I don’t think it’ll hurt us too much.

On the Bright Side

The database of all previously downloaded pages is a flat tsv file which contains full relative paths to each of the pages it indexes. Each line has the following format:

[Base website URL]    [Page path]    [Website Path (relative)]

So, for example, one line might look like this:

www.baltimoresun.com    www.baltimoresun.com/classified/realestate/whosonmyblock/index.html    7_17_2013_npsnap/MD/Baltimore/

A simple query to this database would yield full paths for every file downloaded (combining the second and third fields using the first field as a key). A script run from the directory containing the all crawls could easily retrieve all or selected pages for processing.

Once I’ve finished indexing the crawls, I’m going to start extracting pages. In the meantime, I’m learning about machine learning and Alchemy.

Monday, July 22, 2013

Counting

I spent a few hours this weekend going through Python tutorials on Codeacademy. While most of the stuff was old hat to me, it was useful to get a sense of what other people consider good style. I also learned a few neat tricks, such as list slicing and lambda functions. Anyway, on with the rest of the post.

Counting Pages

I spent much of last week developing a script that would count the number of new pages in each crawl compared to previous crawls of the same website. This approach is in answer to the mirroring problem previously discussed. Since it seems impractical to extract new content by mirroring each site, we will be continuing with the crawls quaque altera die, from which we will create a list of files which appear in the newest crawl but not in previous crawls. We assume that, give the crawls are of sufficient depth, this method will yield mostly new articles.

So far, the results are encouraging: we seem to be gathering several hundred new URLs per site between crawls. Additionally, the script that counts new pages also removes redundant files to save disk space. A tsv file keeps a record of every URL the counting scripts has seen, and new crawls are compared against this database.

Looking Ahead

Now that it seems that our crawls are yielding usable data, we can start extracting the text. To do this, I'll be learning how to make calls to the Alchemy API using Python.

I will also begin teaching myself about machine learning. In the next few days, I plan to buy Hilary Mason's 'An Introduction to Machine Learning with Web Data'. That should keep me busy for a while

Tuesday, July 16, 2013

Python List Comprehension!

This:

while not all([x.condition for x in obj_array]):
    Do something


...is so powerful!

That is all.

Monday, July 15, 2013

Mirroring

The last week has seen some real challenges in setting up the daily crawls. Most of the challenge has come from the necessity of the following parameters:

  1. Capture New Content
    • We are only interested in capturing new content as it hits the web. Older pages will be picked up in separate Deep Crawls
  2. Timeliness
    • The snapshots need to capture new content daily (or every other day)

Wget: Not an ideal tool

The problem is that wget takes a long time to mirror a site, partly because it proceeds in a breadth-first fashion. Since the articles we're looking for are typically deep in the directory structure of the sites we're crawling, wget needs a long time to find them. This also means it scoops up a large number of pages we don't want, not to mention re-downloading pages we have previously downloaded already.

Additionally, many websites have auto-generated content, which causes wget to tie itself in knots (e.g. downloading each page of a calendar from now until eternity).

In order to combat these woes, I spent a considerable amount of time last week reading forums, advice boards and wget's own formidable man page in order to better understand the tool.

I discovered that wget was designed with time-stamping functionality that, in theory, allows it to only download pages that are new or have been updated since a local copy was created. Unfortunately, this only works if certain information is present in the file header, and since more than 98% of the pages I tested lacked this information, wget defaults to the time of download, which is not very useful.

On the bright side, I learned about a lot of useful options, such as the --spider flag which, when combined with the -r (recursive download) flag, will have wget create a file with all the URLs on a particular site without actually downloading any content. This got me thinking about a possible solution to our problem.

(As a side note, I did look into a tool other than wget for downloads. I found that as poorly-suited wget is to the task at hand, every other tool would be even worse.)

Not Total Kludge...

The current plan is to:

  1. Run wget --spider -r on each website in our list
  2. Create a list of all links that point to files (excluding 'index.html')
  3. Compare this list to the list of files downloaded in the last crawl
  4. Run a non-recursive wget on each of the unique links
  5. Save output to a time-stamped directory system which preserves location information (city, state, etc.)
I worry about the inelegance of this solution, both from an efficiency standpoint and also from the the standpoint of someone else being able to implement it in the future. Since I couldn't think of any other viable solution, however, I began writing implementing this approach

I have been testing what I've written as I go along, and at this point I think I'm fairly close to a working script. There is a problem, however.

Wget Strikes Again!

In order for the approach that I have outlined to work, we will one initial Deep Crawl from which to generate the first Blacklist of URLs. As a test, I started mirroring www.oregonlive.com on Friday. It hasn't stopped yet.

Possible solutions involve using the -R tag, which restricts wget from downloading URLs that contain certain regex-matched strings.

Other Notes

In moments of frustration I have turned to other tasks, which I will briefly discuss now:

I made a major change to the way that jobs are scheduled in the crawling scripts I've been using. Previously, the script used qstat -j to check the job status of every wget job that had been submitted. The purpose of this was to limit the number of jobs running at any one time, but this method was causing delays of up to a minute as the list of jobs previously submitted increased to over 2000. Now, if a job has been completed, the script removes it from the list of job numbers to check, which has eliminated the delay.

I have started migrating the language crawls I did for Ann earlier this year away from the old scripts that have been managing them to the new, Object Oriented scripts. The new scripts aren't working just yet, but they're getting there. The purpose of this is to make it possible for someone down the road to use my scripts as a black box crawling utility. Currently, they would have to go in and edit numerous lines of code in the main script.

In the vain hope that they will be useful, I have been maintaining the daily snapshots that are currently running. I wonder, though, if I should begin removing files we probably aren't going to use

Monday, July 8, 2013

Crawls, Crawls, Crawls

Last week, Ann and I met with Chris to discuss short-term and long-term goals. After spending a few days thinking about how I'm going to proceed, and trying out a few of the options, I'm ready to report

Long Term Goals

The overall structure of the project moving forward is as follows:

Finish Setting up the Crawls

The first order of business is to make sure all the data is coming in as we want it. That means having deep, historical crawls for each of our newspapers, as well as daily snapshots which contain only newly-created content.

Text Extraction

Once we have the data rolling in, we have to make it usable. Chris suggested using an online service called Alchemy. Apparently, it's tailored to just the kind of job we're doing, since it specializes in extracting text from html pages, even if they aren't well- or consistently-written. Chris suggested using this approach over lxml and BeautifulSoup since the structure of the pages won't necessarily be known, and these tools are mostly parsers. Even though we may lose some content (short bits of text like bylines, for instance), Alchemy will probably be our best bet. I have yet to play around with it, however.

Classifying the Data: Using Machine Learning

Once we have workable data, Chris and Ann will help me use some black-box machine learning tools to automatically pick out which articles have to do with gun violence. This will involve picking good keywords, choosing training data and no doubt some hand-annotation. My favorite.

Mechanical Turk

Once we have a fairly reliable way of picking out articles concerning gun violence, the next step will be to set the Turkers loose on the data. This will require brushing up my Java skills, learning about xml and figuring out how Apache Ant works. The questions that we will be asking the Turkers will probably be simple at first (eg: 'Does this article concern an act of gun violence?'), but will, with luck, eventually seek to answer some of the questions I outlined in my previous post.

Short Term Goals

A journey of a thousand miles begins with a single step.

In-Depth Crawls

Using the Object Oriented Python scripts I have developed, I have begun some in-depth crawls. First, I mirrored 3 sample websites: Jim Nocero's 'Gun Report' blog, The Baltimore Sun, and The Oregonian, my hometown paper.

These crawls went pretty well, so I expended the scope to the entire list of websites we got form Marcus at Newspapermap.com. This proves less successful; only pages from a handful of states downloaded.

I think part of the problem is that the crawls didn't have enough time to run. Each website was given less than a minute to mirror, so I changed the crawl frequency to every 2 days instead of every day. I also added a logfile to the crawl which stores information such as the initial parameters of the crawl, and then information about each wget job that is submitted to qsub, including a timestamp. If increasing the time limit doesn't help, I'm hoping this will let me get to the bottom of the the problem.

Snapshots

Setting up the snapshots may prove more challenging. Since we only care about the most recent content, time is a limiting factor. Mirroring each site completely would not only use up disk space, but would take too much time for daily or even every-other-day crawls.

I have read through the wget man page to see if we can use the timestamps of the external pages to determine if we should download them. It looks like the mirroring functionality for wget combined with the timestamping functionality may be useful, however the timestamping seems to work in only a rudimentary fashion. wget -N www.foo.com/bar will check a local version of www.foo.com/bar and download the server version if either the local version doesn't exist, or the server version has been modified more recently. This is all well and good, except we don't want to keep an up-to-date version of the sites we're mirroring, we want to capture *new* content. At this point I see several ways of proceeding:

  1. Decrease snapshot frequency to allow for full-site mirroring. Later, we'll use Python scripts and timestamps to get only new articles
  2. Screw around with wget for a while to see if we can get the functionality we're looking for, i.e. it does the timestamp-checking for us.
    • The problem here is that I don't yet know how to get wget to check remote timestamps against files in crawls/foo/ but download the new files in directory crawls/bar/.
  3. Use a different download tool, such as curl. Currently, I don't think curl is installed on the nodes to which I have access. I also haven't done an research yet on curl's functionality

Tuesday, July 2, 2013

Back In Baltimore

And Now for Something...

I'm back in Baltimore, and Ann has me working on a slightly different project today. I will be annotating a data set of Latin bird names to see if they translate well into their English common names. If this goes well and I finish by 5pm or so, I will do some work on the newspaper crawls I started a fortnight ago.

So far my favorite bird name is the "pyrohypogaster" which translates to "fire below the throat" from Latin.