Web Scraping with Python

kaycebasques · on May 15, 2022

I think the overall software architecture approach of this post is fundamentally backwards. Given how much of the web is rendered client-side these days you need to start out with a headless option. Headless means that you fire up a true browser and then automate the actions that you need to perform. It's indistinguishable from a real person using a browser. If you try to use urllib3 on a webpage that does heavy client-side rendering then you're going to get incomplete HTML returned from the server (because the website intends to use JavaScript to complete the rendering of the page). On the more rare occasions when you are dealing with static HTML (i.e. there is no rendering on the client; the HTML returned from the server is the complete content) then you can use something like urllib3.

Re: which headless library to use this post mentions Selenium which was one of the first headless libs but from what I've heard probably not the best (in terms of developer experience or reliability or robustness) but that's only hearsay... I've never used Selenium myself. Playwright seems like the best option in town if you want to use Python. Built by the former Chrome DevTools team (meaning those people really know how browser internals work). https://playwright.dev/python/docs/intro

samwillis · on May 15, 2022

From my experience headless browser scraping is in the order of 100x slower due to increased bandwidth, cpu and memory. I would seriously suggest the other, start with traditional scraping and if you can’t make it work then go headless. The difference in speed is to the point that going a longer route (maybe having to scrape 10x more pages for example) to the content that enables you to not use headless will probably still end up quicker.

On the memory side, headless you end up with far more memory leaks, having to manage stopping and starting new browser instances while maintaining scraping state. The devops overhead is probably 10x more with headless.

taosx · on May 15, 2022

Agreed (especially on the devops overhead), some options to lower the bandwidth usage using request blocking: - block ads - blocks videos, images - css stylesheets

In the past I've built scraper infrastructure (headless pools, credential stores, proxy managers, agent profiles) and managed to get a pretty efficient service by tracking cpu,memory,network usage per each job and writing specialized versions. I got pretty far trying to automatically generate specialized scrapers from previous requests but I moved to other projects.

Scraping becomes boring really fast if you don't use the data in meaningful ways.

elorant · on May 16, 2022

If you don't load images, and you disable 3rd party JavaScript you can save a lot of loading time with a headless. I'm using a preconfigured profile in Firefox with uMatrix installed and sites load at least 50% faster. Add another 20-30% from not loading images, and suddenly using a headless becomes affordable.

MonaroVXR · on May 16, 2022

What is traditional?

datalopers · on May 15, 2022

Headless scraping quickly becomes a very expensive approach when you try and scale the effort. I only employ it when absolutely necessary. And it’s most definitely distinguishable by any modern (incapsula, perimeterx, cloudflare) WAF.

apienx · on May 15, 2022

The Apify library tries to address most of these issues (I'd say quite successfully). Apify.com provides a platform you can deploy the scrapers on. And yes, at scale.

There's also a software marketplace where you can order custom scrapers. 98% of the projects ran thru it have a 5-star rating (Disclaimer: I moderate that marketplace). Pro-tip: submit your project with a Gmail address to skip sales and reach me directly.

pocket_cheese · on May 15, 2022

A federated marketplace for scrapers is an idea I have thought considerably about. If you have time, I would love to chat to a) discuss being a paying user and b) to talk about the industry and see how we could submit some scrapers.

Let me know how I can reach out to you!

daolf · on May 15, 2022

We've decided to go this route because based on billions of web-scraped page, Headless-based scraping is still a minority. And, it's way harder and more expensive to do at scale.

lapser · on May 15, 2022

Or you could use the API that the web app has to inevitably use.

matheusmoreira · on May 15, 2022

Love this approach. We can just bypass all the normal web scraping and get the structured data straight from the source. These APIs are usually no less stable than the ever changing HTML structure anyways.

Case study: YouTube.js

https://news.ycombinator.com/item?id=31021611

https://github.com/LuanRT/YouTube.js

chrsig · on May 15, 2022

this is assuming that it's documented...otherwise you're just hand evaluating javascript to figure out what it would call...and then you get to thinking that you should just embed a javascript interpreter and evaluate it...and at that point, you've gone down the path of implementing a headless browser.

hombre_fatal · on May 15, 2022

Almost everything uses simple JSON APIs which are far more trivial than html scraping. You also don't need to evaluate the Javascript to figure out what it's doing (something I can't even imagine doing, do you really do this? and where have you done it?), just browse the website normally with your network tab open and look at the endpoints and you're basically done.

Obfuscated APIs like Pokemon GO and Netflix are in the tiny minority.

camgunz · on May 15, 2022

I just wrote ~20 scrapers and maybe 3 ended up being able to grab data from a JSON API. Mostly what I ran into was (an HTTP API that returns) templated HTML, and wacky Sharepoint stuff. For the Sharepoint stuff, I found I often had to grab tokens out of script tags. Sometimes they were in hidden inputs, but either way is kind of the same thing. I was ready to break out a JS parser, but fortunately I didn't need to.

I did run into Cloudflare DDoS protection and Incapsula, which I will say is pretty irritating and IMO antithetical to the web. Incapsula is so bad I get captcha'd just browsing around in a Firefox private window. If I were polling every few seconds or something I'd get it, but denylisting all AWS IPs or looking for "headless" in the User Agent (or looking at navigator params, testing TLS fingerprints, etc.) is bonkers. It's the laziest kind of upselling from web developers where you're making the site harder to use, but not actually keeping real scrapers out, because they're doing even more JavaScript interventions ahead of the HTTP request and using residential IP proxies.

hombre_fatal · on May 16, 2022

Interesting, I also have a big scraping project and the breakdown is probably like 70% HTML parsing, 25% JSON APIs, 5% weird APIs. I can of course imagine that this pie chart simply depends on the sites/genre/industry you're scraping.

DDoS protection does throw a wrench into the mix, though I don't blame anyone for using it. DDoS protection might seem antithetical to the web, but... so is DDoS and abuse.

Kind of like how being an asshole is antithetical to getting along as a society but you still have to address the reality that there will always be abusers and bad actors. I also think being able to do what you want with your service is a fundamental part of the web incl putting it behind a captcha. It's just part of the beautiful chaos.

camgunz · on May 17, 2022

In the large I don't disagree: people have the right to protect themselves. But I think you've made an argument for caching, not for captchas. I'd even be fine with changing cache-control interpretations to "cache this, because if you come back before it's changed we won't serve it to you again". But this stuff is obvious web developer upsell.

For what it's worth, it didn't even work. Headless Chrome and some editing of the JavaScript environment was all it took. So it's definitely a ripoff.

jjeaff · on May 15, 2022

>Almost everything uses simple JSON APIs

I wish that were true. Maybe most new web projects do. Unfortunately, most web projects are not new.

hombre_fatal · on May 16, 2022

I meant of the websites that aren't server-rendered HTML. If a website is a Javascript application where you can't just curl + xpath into it, I've found that it almost always has a simple JSON API.

lapser · on May 15, 2022

Not really. The API these days tends to be JSON so you can just figure out how to it works and what represents what.

For example I've been able to reimplement xmltv scrapers for several sources in less than a 100 lines with Scrapy. It's not hard, just requires a little discretion.

chrsig · on May 15, 2022

The difficulty isn't in making a scraper for a single site, but rather in the general case.

That is, making a scraper that can be pointed at an arbitrary site not known at the time of development.

vertere · on May 16, 2022

I assumed most people were talking about dealing with single sites. From your previous comment about API documentation and "hand evaluating" Javascript I gathered that you were too. How would those things help one solve the general case?

elorant · on May 15, 2022

From my experience news sites are the one category that requires a headless browser the most. With e-commerce sites, or anything else, it's like 80% of the cases will work with a normal http request.

1vuio0pswjnm7 · on May 15, 2022

"Given how much of the web is rendered client-side these days you need to start out with a headless option."

What does "rendered client-side" mean.

Assuming that "rendered client-side" means interpretation and execution of Javascript is necessary to read a site's textual content, then how much of the web is rendered client-side.

If the focus is on textual content, e.g.,, someone is primarily "scraping" text as opposed to images and video, I would guess that only a minority of the web is "rendered client-side". How would we prove otherwise.

This guess I am making would not be an uneducated one. I have been accessing the web without using Javascript for over 30 years. Today, I still use a text-only browser to render HTML as text/hypertext. This allows me to read the site's textual content, quickly and easily. I initiate most HTTP requests with TCP clients, not the browser. All requests, whether from TCP client, browser, or otherwise, are made through a localhost forward proxy. If most websites were truly dependent on Javascript, it stands to reason I would not be able to read much of the web. In other words, another web user who reads the web with a Javascript-enabled browser should be able to read websites that I could not read. This has not been the case. In fact, I often see commenters on HN complaining that they cannot read a site that I am having no trouble reading. The culprit is often Javascript.

The truth is that I rarely encounter a site that cannot be read with the text-only browser. For example, I can read the content of almost every site submitted to HN. A very small minority of sites I find are, more or less, empty shells with links to some Javascripts but no textual content for the visitor to read. These "landing pages" expect a Javascript-enabled browser that automatically follows links in the page (e.g., to remote Javascript files), and that retrieves, interprets and executes Javascript automatically and indiscriminately.[FN1] In what some might see as a Rube Goldberg design pattern, the scripts then make HTTP requests to the "real" site. In such cases it generally only takes me a few minutes to find the "real" site, often what some refer to as a "JSON endpoint".[FN2] However this process has not lead me to rely on a "headless" browser to read websites.

Honestly, if a majority of sites adopted the "JSON endpoint" approach to serving textual content it would make reading websites even easier for me. I could just retrieve JSON and reformat it to a uniform brand of simple HTML that I prefer, as I already do for some sites. I could make the format of all websites 100% identical. IME, a web of uniformly-formatted content is much easier and faster to digest. I would imagine it would easier for machines to digest as well. The text-only browser I use currently makes the format of all sites look almost the same, since it only uses a single font and so many websites use similar designs. Because it does not automatically follow links or execute Javascript, it also tends to make the "load" time of all sites very similar. For me, this uniformity speeds up the ability to digest web content as compared to using a graphical browser for the same purpose.

FN1. Today we see "modern" browsers incorporating an ever-changing array of "features" and options to try to mitigate the risks of this behaviour.

FN2. Generally, IME, these "endpoints" serve the textual content with minimal markup or sometimes no mark up at all. Thus, the end user is free to format the text into whatever design suits their personal tastes. As a website visitor, this is relatively more efficient IMO than trying to read an infinite number of possible "web designs" which is the approach we currently see on today's www. It is more predictable. With the later approach, visiting a new website with a Javascript-enabled, graphical browser is always a "surprise". It might be easy to read or it might not. Visiting "endpoints" generally does not suffer from this problem.

ohyoutravel · on May 15, 2022

Why do these types of low quality, seemingly spam / SEO articles always rise to the top on weekend mornings? Is it lack of competition? Easier to manipulate votes?

daolf · on May 15, 2022

"low quality".

I'm hurt :(

PS: we spend tens of hours writing those piece of content and even pay a technical editor to spot the typo and make it more readable since we're not native English. You might not like this post, but I can assure that genuine care was put into writing this!

degenerate · on May 15, 2022

Scrolling through your article I disagree, it's high quality content. What converts it to "low quality" is the bait-n-switch title. This is not "everything you need to know" -- this is "how to get started from scratch".

Metaphor would be "Everything you need to know about fixing cars" and the article shows you how to check the engine light, change oil, rotate tires, and replace spark plugs. There's just no way to make a promise that large and have your article be considered high quality.

vertere · on May 16, 2022

The title is problematic but I don't see how that justifies someone calling the whole article low quality or spam.

daolf · on May 15, 2022

fair point!

wintercarver · on May 15, 2022

I thought it was a nice summary, concise, organized, with examples and references. Will revisit it should I need a reminder on scraping. Would not call it low quality at all.

Would recommend you ignore passing comments with no constructive criticism. The title is going to be a point of contention as it’s a big claim and probably being misinterpreted as not “everything you need to know [to get started]” but rather “everything you need to know [ever is in this one article and you’ll need not read anything else]”.

ohyoutravel · on May 15, 2022

I don’t think it’s very good and many other highly-rated top level comments seem to agree that not only does it have a scammy SEO “top ten best ${X} in ${CURRENT_YEAR}” but there is a mismatch between what the article is attempting to do with how it is attempting to explain and do it.

While I’m glad it’s not GPT-3 level spam, or outsource to third world country for copy level spam, in my opinion the article fails in several fundamental ways, noted above. Putting “genuine care” into something is commendable, but is not a substitute for quality, relevant content.

OTOH you’re getting lots of clicks and views for whatever product you’re selling, and even my comments help the “traction” HN gives it, so it doesn’t actually matter what I think.

tsukurimashou · on May 15, 2022

change the title to "everything you need to get started..." instead of "everything you need to know" and try to go easy on SEO optimization

most of negative comments will go away

sixhobbits · on May 15, 2022

It can take 100 hours or more to put together a guide like this. It is painful to do but helps many people, so its ultimately very rewarding. Every developer I know has learned more from free guides written for free than they have from paid courses, bootcamps, and often university degrees.

But thanks for your contribution I guess..

is_true · on May 15, 2022

I think it's a nice article, just needs an "almost" and everyone should be happy. Did you write it?

sixhobbits · on May 15, 2022

No I believe @daolf did, but as a fellow writer I know how hard negative feedback hurts (and how rare it is for happy readers to comment).

pahn · on May 15, 2022

The article is an overview aimed at beginners, but as such actually pretty good and helpful. It does not seem to overtly promote their product. Classifying this as spam / SEO only because one is not the target group is not fair.

Luc · on May 15, 2022

The article was obviously written for SEO purposes.

pahn · on May 15, 2022

Yes it was, but is it automatically bad because of this? It would be if it would be unhelpful, low-quality or if it would contain deceitful content pushing their product – but I do not see any of this.

is_true · on May 15, 2022

Probably a tiny bit of manipulation: https://news.ycombinator.com/item?id=30025005

pfranz · on May 15, 2022

I end up needing to do something like this every few years. Even when I use Python at my day job every day, it's easy to miss changing best practices when it's slightly outside your domain. So, like I did with this article, skim it to see how credible it seems and take note of things that look like they've changed since I last had to do it (and take a mental note in case it comes up in the near future and I need to get something up relatively quickly).

mywaifuismeta · on May 15, 2022

I'm pretty sure there is some manipulation going on to get this article to the top. It's extremely basic and "baity" compared to what you usually find here. I hope HN doesn't turn into the next Medium. It has been getting worse recently, which makes me think people have found better and better ways to create and manage spam/upvote accounts. Perhaps GPT-3 allows them to automate the karma generation for new accounts :)

idk1 · on May 15, 2022

What makes you think this is low quality. I code ruby and not much python but it looked really good to me.

mateuszbuda · on May 15, 2022

Maybe they use their product to generate upvotes O.O

pedrovhb · on May 15, 2022

That's nice, but I don't see much value in learning about sockets for scraping; it's way too low a level. The lowest level I found useful was using a requests/httpx for requests and using regex to parse data when the data you're scraping has a constant enough structure and you're scraping a large number of pages, as regex is a lot faster than parsing html.

I'd add that it's often worth spending some time looking at the website for alternate ways than the obvious one of getting the data you're after. sitemap.xml sometimes give useful hints.

Another golden trick is to learn reverse engineering mobile app APIs with mitmproxy or something like it. Nowadays it's kind of a pain to do since Android has been locking things down more and more, but it's still quite possible. Apps very often provide endpoints that give you structured data when the web version is server-rendered HTML only, have fewer anti-scraping measures and rate limiting, and even provide data that isn't available at all for the web version.

Toxygene · on May 15, 2022

> as regex is a lot faster than parsing html

This person would like a word with you -- https://stackoverflow.com/a/1732454

:D

pedrovhb · on May 15, 2022

Well, yes - he's saying "regex is not appropriate for parsing html", and I'm saying "regex is faster than parsing html" - they're not contradictory statements, and both are true :)

To be clear, I'm not talking about building a syntax tree or a way to generically extract elements based on a CSS path selector. I'm saying if you're only interested in a couple of data points in a 3 MB HTML document, and you're sure they're always between some other specific text or even tags, then it's more efficient to use a simple regex than it is to parse the entire thing, which is computationally expensive when running over a large number of large files.

hashmush · on May 15, 2022

There's a big difference between parsing HTML and

> using regex to parse data when the data you're scraping has a constant enough structure

Regex is fine, just don't parse the HTML itself.

harshreality · on May 15, 2022

What percentage of web scraper routines resort to regex when they should at least start with xpath or some equivalent parser?

melenaboija · on May 15, 2022

The first comment says a lot about it:

> I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death

matheusmoreira · on May 15, 2022

I love this answer so much. I'm surprised it hasn't been deleted yet like many of my other favorites.

pfranz · on May 15, 2022

For personal projects I generally follow the steps this blog lays out; start with the light and low-level APIs and work my way up as needed. I do usually skip over sending raw sockets when I start, but I think knowing them is worthwhile for troubleshooting and optimizing. I often find myself jumping to different levels when navigating scraping--from http headers to javascript rendering. While you can touch most of those things with requests, I find it easier to reproduce exactly what I see my browser doing with lower level APIs. The backend might be tightly-coupled with the front end. So you might get stuck on a specific header, user-agent string, or something often related to sessions or login.

f311a · on May 15, 2022

That's just a basic introduction. I would not call this article "everything you need to know".

is_true · on May 15, 2022

I think it's actually not that bad. Scrapping is a topic that is as broad as the number of sites on the web, a minefield of corner cases.

daolf · on May 15, 2022

Thank you!

stingraycharles · on May 15, 2022

Yeah, it’s as if someone posted an article “everything you need to know about cooking” and just explained the concepts of plates, pans, and some of the kitchen appliances.

I guess the fact that it’s currently very high on the front page of HN kind of confirms this type of post works, though, which is unfortunate.

daolf · on May 15, 2022

Hi there, co-author here.

Always to improve the content we're writing here. What else would you have expected to read in such an article?

edent · on May 15, 2022

I think it is good. You have to remember that some of the loudest voices on here take things extremely literally. They have no concept of hyperbole for emphasis. Or, indeed, anything which makes writing interesting to read.

Is your guide everything someone needs to know? No. But anyone literate in the ways of modern English understands what you mean.

It is an excellent guide and I think you should consider expanding it & perhaps creating a book.

Please don't be discouraged by the people on here who don't have the skill or courage to write or submit anything.

CJefferson · on May 15, 2022

Nowadays I would jump straight to selenium (or similar), as most websites feature AJAX or similar, so need a full browser.

Then, you don't actually do anything with selenium, click a button / link, or anything interesting.

pfranz · on May 15, 2022

While it's true I often end up needing something like selenium, it's way more heavy handed and I usually reach for it last. It doesn't scale as well, harder to troubleshoot IMHO, and more libraries and dependencies to deal with in a language where that's already not great.

daolf · on May 15, 2022

Agreed, this is why on the Selenium paragraph we link to this article that go much more in depth https://www.scrapingbee.com/blog/selenium-python/

danmur · on May 15, 2022

I think it's a pretty good article personally, sounds like the complaint is just about the title :P

chasd00 · on May 15, 2022

What I’ve done is pay very close attention to the network traffic in your browsers dev tools. The data has to get to the browser somehow. Once you’re able to get a session token/cookie then you can figure out what GETs or POSTs you need to get the data you want by watching the requests your browser makes.

mynameismon · on May 15, 2022

Perhaps the only issue I would have with this blogpost is using Postgres. By all means, SQLite can do the exact same thing, just easier for a beginner, since they don't have to wade through a mess of networking. Just add the binary to the PATH and one is good to go.

taosx · on May 15, 2022

Just don't forget to optimize it for writes (WAL-mode...etc) when having lots of sources.

shahidkarimi · on May 15, 2022

Scrapy is there to make all these happening under a single framework.

bschne · on May 15, 2022

Aside: The last time I had to scrape a lot of data from the web, I additionally used SQLite, which is a breeze to use with Python (basically one import statement and you're set). It might be overkill for some cases, but I found it a huge boon for keeping track of which pages were scraped, which failed, and doing subsequent data processing and parsing "offline". It made it so much easier to recover from the inevitable random error or different markup somewhere deep in your list of pages to scrape etc.

srvmshr · on May 15, 2022

We should have some community guidelines to keep out Medium/Towards Data Science and similar low-effort article sources from HN.

Genuinely in favor of lesser submission vs. increased noise in submissions. Beginner articles are not taboo, but goes against having high quality insights in general.

PS: Flagging is mechanism to filter by community efforts. Guidelines set some general preconditions to the quality of articles for larger dissemination.

edent · on May 15, 2022

You can either hit the "flag" link, or submit something better.

TBurette · on May 15, 2022

Is there a good way to combine Scrapy framework (retry, rate limiting,..) with a headless browser such as selenium (to get full js-loaded client-side data)?

When I had to do it I ended up duplicating each page request twice. Once for scrapy and once again with selenium.

ihartley · on May 15, 2022

You can use something like scrapy-playwright[0] to run a headless browser framework as your download handler. I think there are versions for some of the other headless systems, if you prefer those.

[0] https://github.com/scrapy-plugins/scrapy-playwright

samwillis · on May 15, 2022

scrapy-playwright is good, and Playwright is awesome. However due to the architecture of Playwright it just keeps accumulating memory until it crashes. You will want to set up your scraper to save its state regularly, cleanly shut down and restart. But once you have that working it does work well.

inshadows · on May 15, 2022

How do scrapers deal with being nice to a website these days? I'm talking multiple IPs, request rate, exponential backoff. Is there any body of knowledge for this?

ducktective · on May 15, 2022

Consider pup https://github.com/EricChiang/pup

dmortin · on May 15, 2022

How do scrapers deal with randomized classes in web pages which is more and more common these days?

Relying on the page structure only is not a robust alternative.

edmundsauto · on May 15, 2022

I've had some success with running a meta-scraper that will search for known value on a page, then back out the page structure from there. It won't help with randomly generated class names, but 95% of tasks I've written aren't this complex.

For sites that are hard to scrape (usually bigger sites that get scraped a lot), I pivot towards buying a data feed. Economies of scale incentivize these data companies towards putting someone on maintaining the feed full-time.

holografix · on May 15, 2022

How do people get around browser finger printing by “Sign in with Google” these days?

All I get is “your browser is not safe” etc which blocks me completely.

photochemsyn · on May 15, 2022

One problematic thing that pops out immediately for a Python-centric approach is that they don't mention that this is all best done in some kind of Python virtual environment, like miniconda or virtualenv. They just suggest 'pip install package', which is not a good approach for anyone (and particularly not beginners) - unless you want to end up with this:

https://xkcd.com/1987/

Looking around a bit with the requirement that the online tutorial mention this rather important fact, I found this alternative option, which helpfully notes:

We want to run all our scraping projects in a virtual environment, so we will set that up first.

https://python-adv-web-apps.readthedocs.io/en/latest/scrapin...

Compare and contrast that discussion with the one presented in this post - the above is far superior. Also, I don't understand why one would suggest PostGreSQL to a beginner when sqlite3 is included already in Python, and is going to be easier to use for small databases. Towardsdatascience seems to have a nice intro-to-sqlite3 tutorial.

stall84 · on May 15, 2022

This is great.. Mainly because the very first thing he does is explain the network requests themselves, focussing on the (somehow often left-out) fact that you are going to have to spoof a browser (or headers associated with it) almost always these days to get around bot-protections.