Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask YC: Help with how to save data from crawler
5 points by groovyone on Nov 8, 2008 | hide | past | favorite | 6 comments
Hi all!

We are creating a web spider/crawler for experimenting with classification of sites. We are at a point where we can't make a decision and I was hoping someone out there who has done this kind of thing before might help us make a decision.

Here's what we are doing as an experiment

1. Taking 1-2 million domain names and crawling index page plus approx 10 internal pages based on whatever links we get

2. The above will be done in a polite way so not to overload the resulting server and the spider will have a link back to us.

3. We want to run 3-4 downloaders on individual machines and we are using either Twisted or Pyro to do this

The above bit we're OK with and its done. We have two options we think for the next stage. Either:

- Push all downloaded data into mysql to process and for our parser machine to access and parse/classify

or

- each downloader saves the data as a file to its HD and then our parsing machine takes this information across the network

I can't find easy information about how search engines save their data (into a database or just locally to the filesystem) and we kind of feel at this end that it is fundamental at this point to decide on the correct path.

Any help or advice appreciated. Even criticism :)

John



When you are working on big problems, it's sometimes easy to let yourself get stuck on some unimportant decision. Usually it's a sign that you are unsure of something more important but you don't want to think about it.

If you just want to run an experiment on 10M pages, then use whatever you feel comfortable with. The important thing is NOT files vs sql but whether your classification idea is worth spending time on. Who cares if it's inefficient? That's not what your experiment is about.


Smells like 500GB of data. I'd save keep the crawled data in filesystems on the crawling boxes. Then you can load your mysql database and when it fails because <<insert-unforeseeable-circumstance>> you can take another shot at loading it from your data.

After you resign yourself to working with a subset of the data in mysql you will learn how to compute what you really want to know and write a fast processor to just scan the spooled data you have on your search machines and put that into the database instead of the raw data.

[[edit: maybe 500GB instead of 5TB, got a little crazy on my zero key in bc]]


I agree that initially you should dump it to a local filesystem. Since this is an experiment you don't want to get bogged down in DB performance details.

Also, if HD space is a concern, occasionally tar/zip up a bunch of the data. HTML is very redundant and I'd bet you could squeeze 500GB of HTML down to < 50GB, even more if you have a lot of pages from the same site.

Really, a lot of this depends on what resources you have available and how you want to process the data later on. If you are classifying pages independently of one another then why bother pooling them to a centralized DB? Just run your classifier on each node and pool those results instead.

An alternative solution is S3, which I've used for crawling storage before. Its not ideal for data processing since you have to constantly pull data over the network, but its an easy way to get centralized storage.


Take a look at what is out there.

If you run a simple crawl with Heritrix (as example) you'll notice it stores everything in 'ARC' files which are basically compressed (zip) archives with an index to access individual records (via offsets).

I would avoid sticking everything in a database, although you could probably get away with it -- but I agree with aristus that it probably doesn't matter at this point.

Another idea is that you could look at a static html dump of Wikipedia and see how their structure their tree (three letter prefixes)

On the flip side, having it in the DB will probably be easier in terms of managing it (one place to backup) and possibly as an easier way of splitting up workload across multiple boxes -- ex: three boxes could query db, suck down all pages for a couple hundred domains, do processing, insert when done


> I can't find easy information about how search engines save their data (into a database or just locally to the filesystem) and we kind of feel at this end that it is fundamental at this point to decide on the correct path.

There are other possibilities, including both local and remote datastores that aren't really databases.

However, their approach doesn't matter because your problem is significantly (>1000x) smaller and different (for one, you're not running continuously).


I don't know what the big search engines do, but storing the data in a database for the single purpose of parsing it later sounds a bit unecessary. If you are only using the database as storage, the file system will do a better job.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: