Ask YC: Help with how to save data from crawler

aristus · on Nov 8, 2008

When you are working on big problems, it's sometimes easy to let yourself get stuck on some unimportant decision. Usually it's a sign that you are unsure of something more important but you don't want to think about it.

If you just want to run an experiment on 10M pages, then use whatever you feel comfortable with. The important thing is NOT files vs sql but whether your classification idea is worth spending time on. Who cares if it's inefficient? That's not what your experiment is about.

jws · on Nov 8, 2008

Smells like 500GB of data. I'd save keep the crawled data in filesystems on the crawling boxes. Then you can load your mysql database and when it fails because <<insert-unforeseeable-circumstance>> you can take another shot at loading it from your data.

After you resign yourself to working with a subset of the data in mysql you will learn how to compute what you really want to know and write a fast processor to just scan the spooled data you have on your search machines and put that into the database instead of the raw data.

[[edit: maybe 500GB instead of 5TB, got a little crazy on my zero key in bc]]

pz · on Nov 8, 2008

I agree that initially you should dump it to a local filesystem. Since this is an experiment you don't want to get bogged down in DB performance details.

Also, if HD space is a concern, occasionally tar/zip up a bunch of the data. HTML is very redundant and I'd bet you could squeeze 500GB of HTML down to < 50GB, even more if you have a lot of pages from the same site.

Really, a lot of this depends on what resources you have available and how you want to process the data later on. If you are classifying pages independently of one another then why bother pooling them to a centralized DB? Just run your classifier on each node and pool those results instead.

An alternative solution is S3, which I've used for crawling storage before. Its not ideal for data processing since you have to constantly pull data over the network, but its an easy way to get centralized storage.

yourabi · on Nov 8, 2008

Take a look at what is out there.

If you run a simple crawl with Heritrix (as example) you'll notice it stores everything in 'ARC' files which are basically compressed (zip) archives with an index to access individual records (via offsets).

I would avoid sticking everything in a database, although you could probably get away with it -- but I agree with aristus that it probably doesn't matter at this point.

Another idea is that you could look at a static html dump of Wikipedia and see how their structure their tree (three letter prefixes)

On the flip side, having it in the DB will probably be easier in terms of managing it (one place to backup) and possibly as an easier way of splitting up workload across multiple boxes -- ex: three boxes could query db, suck down all pages for a couple hundred domains, do processing, insert when done

anamax · on Nov 8, 2008

> I can't find easy information about how search engines save their data (into a database or just locally to the filesystem) and we kind of feel at this end that it is fundamental at this point to decide on the correct path.

There are other possibilities, including both local and remote datastores that aren't really databases.

However, their approach doesn't matter because your problem is significantly (>1000x) smaller and different (for one, you're not running continuously).

ks · on Nov 8, 2008

I don't know what the big search engines do, but storing the data in a database for the single purpose of parsing it later sounds a bit unecessary. If you are only using the database as storage, the file system will do a better job.