Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, having a few websites of my own, I really do think that point 1 is the worst. I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).

Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.

Thanks for your feedback!



> I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).

If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.


Ha, that's a good idea! Is there a list somewhere of the cidr blocks that are assigned to residential vs server farms? I mean, how can I tell an IP is residential?


The other way around may be easier, i.e. excluding known datacenter ranges. There are some commercial databases for that, i'm not sure if there are any free ones. But you can also do this manually by running a whois on an IP and then extracting the ranges from the whois response and caching then. Then you can look at the orgname or something like that. You can also download the whois databases from the RIRs, but they don't contain the information what kind of entities they are.

    $ dig +short reddit.com
    151.101.1.140


    $ whois 151.101.1.140

    NetRange:       151.101.0.0 - 151.101.255.255
    CIDR:           151.101.0.0/16
    OrgName:        Fastly
    [...]
So if you see a known hoster here then you can exclude it from your statistics.


What I've done in the past is to pull down all the IPs of request I see, filter by unique, do whois for each one of them (you're gonna need to have a backoff/rate limit here as whois services are usually rate limited) and save the organization name, ASN and CIDR blocks, again filter by uniqueness, then create a new list with the organizations of interest and match with the CIDR blocks. Now you have an allow/blocklist you can use.


There are some geoip databases that will denote if it's end user networks and if it's fixed (DSL/cable) or mobile.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: