Well, having a few websites of my own, I really do think that point 1 is the worst. I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).
Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.
> I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).
If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.
Ha, that's a good idea! Is there a list somewhere of the cidr blocks that are assigned to residential vs server farms? I mean, how can I tell an IP is residential?
The other way around may be easier, i.e. excluding known datacenter ranges. There are some commercial databases for that, i'm not sure if there are any free ones. But you can also do this manually by running a whois on an IP and then extracting the ranges from the whois response and caching then. Then you can look at the orgname or something like that. You can also download the whois databases from the RIRs, but they don't contain the information what kind of entities they are.
What I've done in the past is to pull down all the IPs of request I see, filter by unique, do whois for each one of them (you're gonna need to have a backoff/rate limit here as whois services are usually rate limited) and save the organization name, ASN and CIDR blocks, again filter by uniqueness, then create a new list with the organizations of interest and match with the CIDR blocks. Now you have an allow/blocklist you can use.
Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.
Thanks for your feedback!