Hi all!
We are creating a web spider/crawler for experimenting with classification of sites. We are at a point where we can't make a decision and I was hoping someone out there who has done this kind of thing before might help us make a decision.
Here's what we are doing as an experiment
1. Taking 1-2 million domain names and crawling index page plus approx 10 internal pages based on whatever links we get
2. The above will be done in a polite way so not to overload the resulting server and the spider will have a link back to us.
3. We want to run 3-4 downloaders on individual machines and we are using either Twisted or Pyro to do this
The above bit we're OK with and its done. We have two options we think for the next stage. Either:
- Push all downloaded data into mysql to process and for our parser machine to access and parse/classify
or
- each downloader saves the data as a file to its HD and then our parsing machine takes this information across the network
I can't find easy information about how search engines save their data (into a database or just locally to the filesystem) and we kind of feel at this end that it is fundamental at this point to decide on the correct path.
Any help or advice appreciated. Even criticism :)
John
If you just want to run an experiment on 10M pages, then use whatever you feel comfortable with. The important thing is NOT files vs sql but whether your classification idea is worth spending time on. Who cares if it's inefficient? That's not what your experiment is about.