Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nova Spivack said that the crawls have been going for several years. There's a good chance that many of the pages in the archive are unacceptably outdated for indexing purposes.


Hi. I work for commoncrawl. We are about to start an improved recrawl and will be doing this more frequently going forward. In the process we will also consolidate our data on S3 to keep it relevant. But, as with any crawl of the Internet, there is lot of noise in there. We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl, and hopefully this work starts to show results in 2012.


I'm not sure whether major portions of their archive are unacceptably outdated.

But I am sure that it would be logic failure to conclude that it must be out of date simply because they've been indexing for several years. With that logic, Google would be further out of date, having indexed for over a decade.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: