That's nice, but I don't see much value in learning about sockets for scraping; ...

Toxygene · on May 15, 2022

> as regex is a lot faster than parsing html

This person would like a word with you -- https://stackoverflow.com/a/1732454

:D

pedrovhb · on May 15, 2022

Well, yes - he's saying "regex is not appropriate for parsing html", and I'm saying "regex is faster than parsing html" - they're not contradictory statements, and both are true :)

To be clear, I'm not talking about building a syntax tree or a way to generically extract elements based on a CSS path selector. I'm saying if you're only interested in a couple of data points in a 3 MB HTML document, and you're sure they're always between some other specific text or even tags, then it's more efficient to use a simple regex than it is to parse the entire thing, which is computationally expensive when running over a large number of large files.

hashmush · on May 15, 2022

There's a big difference between parsing HTML and

> using regex to parse data when the data you're scraping has a constant enough structure

Regex is fine, just don't parse the HTML itself.

harshreality · on May 15, 2022

What percentage of web scraper routines resort to regex when they should at least start with xpath or some equivalent parser?

melenaboija · on May 15, 2022

The first comment says a lot about it:

> I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death

matheusmoreira · on May 15, 2022

I love this answer so much. I'm surprised it hasn't been deleted yet like many of my other favorites.

pfranz · on May 15, 2022

For personal projects I generally follow the steps this blog lays out; start with the light and low-level APIs and work my way up as needed. I do usually skip over sending raw sockets when I start, but I think knowing them is worthwhile for troubleshooting and optimizing. I often find myself jumping to different levels when navigating scraping--from http headers to javascript rendering. While you can touch most of those things with requests, I find it easier to reproduce exactly what I see my browser doing with lower level APIs. The backend might be tightly-coupled with the front end. So you might get stuck on a specific header, user-agent string, or something often related to sessions or login.