The Data So Far
Below are three graphs which show the frequency of new URLs for the sites that we've crawled. The total number of URLs is 2478. A URL is considered "new" if it has not appeared in a previous crawl. Each subsequent crawl is compared to all previous crawls. Surprisingly, the number of sites with 0 new URLs is quite high for each graph. I will have to see whether this is an actual signal or due to some error in my code.
This graph shows that, ignoring the frequency of 0 new URLs, the distribution is fairly mound-shaped with a mean around 375. This is expected, since this was the first crawl, so all the URLs would have been new.
These results are just preliminary, and I intend to spend some time now checking to see if I'm counting everything correctly. Particularly, I am going to investigate the high frequency of pages with 0 new URLs.