That's nice, but I don't see much value in learning about sockets for scraping; it's way too low a level. The lowest level I found useful was using a requests/httpx for requests and using regex to parse data when the data you're scraping has a constant enough structure and you're scraping a large number of pages, as regex is a lot faster than parsing html.
I'd add that it's often worth spending some time looking at the website for alternate ways than the obvious one of getting the data you're after. sitemap.xml sometimes give useful hints.
Another golden trick is to learn reverse engineering mobile app APIs with mitmproxy or something like it. Nowadays it's kind of a pain to do since Android has been locking things down more and more, but it's still quite possible. Apps very often provide endpoints that give you structured data when the web version is server-rendered HTML only, have fewer anti-scraping measures and rate limiting, and even provide data that isn't available at all for the web version.
Well, yes - he's saying "regex is not appropriate for parsing html", and I'm saying "regex is faster than parsing html" - they're not contradictory statements, and both are true :)
To be clear, I'm not talking about building a syntax tree or a way to generically extract elements based on a CSS path selector. I'm saying if you're only interested in a couple of data points in a 3 MB HTML document, and you're sure they're always between some other specific text or even tags, then it's more efficient to use a simple regex than it is to parse the entire thing, which is computationally expensive when running over a large number of large files.
> I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death
For personal projects I generally follow the steps this blog lays out; start with the light and low-level APIs and work my way up as needed. I do usually skip over sending raw sockets when I start, but I think knowing them is worthwhile for troubleshooting and optimizing. I often find myself jumping to different levels when navigating scraping--from http headers to javascript rendering. While you can touch most of those things with requests, I find it easier to reproduce exactly what I see my browser doing with lower level APIs. The backend might be tightly-coupled with the front end. So you might get stuck on a specific header, user-agent string, or something often related to sessions or login.
I'd add that it's often worth spending some time looking at the website for alternate ways than the obvious one of getting the data you're after. sitemap.xml sometimes give useful hints.
Another golden trick is to learn reverse engineering mobile app APIs with mitmproxy or something like it. Nowadays it's kind of a pain to do since Android has been locking things down more and more, but it's still quite possible. Apps very often provide endpoints that give you structured data when the web version is server-rendered HTML only, have fewer anti-scraping measures and rate limiting, and even provide data that isn't available at all for the web version.