Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can build a generic crawler that pulls pages from sites quickly and then process the pages offline with whatever lnguage you'd like. It's better to have a distributed way of doing things. Plus, there are standards that you need to comply with when crawling someone's website., like not crawl them too fast, or to check their robots.txt file to make sure that you're crawling "allowable" pages. Then once you've pulled their data off, you process it offline and do whatever you need to do with the data. It's not a simple procedure, but it's do-able if you want to spend some time doing it properly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: