Wednesday, July 24, 2013

Graphs

The Data So Far

Below are three graphs which show the frequency of new URLs for the sites that we've crawled. The total number of URLs is 2478. A URL is considered "new" if it has not appeared in a previous crawl. Each subsequent crawl is compared to all previous crawls. Surprisingly, the number of sites with 0 new URLs is quite high for each graph. I will have to see whether this is an actual signal or due to some error in my code.

This graph shows that, ignoring the frequency of 0 new URLs, the distribution is fairly mound-shaped with a mean around 375. This is expected, since this was the first crawl, so all the URLs would have been new.

This graph shows that the frequency of sites with many new URLs declines rapidly, which is what we would expect.

This final graph shows an even steeper decline in the frequency of new URLs, which is what we would expect, since the database is growing.

These results are just preliminary, and I intend to spend some time now checking to see if I'm counting everything correctly. Particularly, I am going to investigate the high frequency of pages with 0 new URLs.

2 comments:

  1. Too bad about all of the 0s, though this is exactly what I expected though hoped we wouldn't see!

    Are the last two plots the number of new URLs given all of the previous crawls? This is discouraging: it means we're getting new content from very few sites. Could you pick some of the 0s and investigate their websites and try to understand why wget doesn't find anything new (e.g. maybe new content is too deep in the tree, or maybe there is no new content)?

    ReplyDelete
  2. I have been doing some investigating. Part of the reason for all the 0s is that wget doesn't have enough time to crawl all the sites, I think. I'm inferring this from the fact that the sites that have 0 new pages are consistent. Also, there are some problems with the list of URLs; the same domain will sometimes be listed for multiple cities with sub-pages for each city ('www.statenews.com/thiscity' and 'www.statenews.com/thatcity'). This wreaks a mild form of havoc with the counting script at the moment.

    Also, I played around in R for a while yesterday, and there seem to be 2 groups of sites: ones with < 10 new pages and ones with > 100 news pages. A lot of the sites in the first bin of the histogram have somewhere between 1 and 20 new pages. These graphs were quick and dirty - I will post some more representative ones soon.

    Additionally, while it seems like we have a lot of 0s, the actual number of new URLs is quite high, i.e. it seems like over 500 sites with over 100 new pages each. So that's good news.

    I'll keep you posted.

    ReplyDelete