mogoz

Scraping

tags: System Design ,Archival

FAQ

Good resources?

https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero
https://github.com/lorien/awesome-web-scraping/tree/master : Awesome list of tools

Legal?

Headless non-headless

headless: no gui (eg. webscraping)
non-headless: gui, visual rendering (eg. if user needs to keep seeing what the automation does)

What are diff kinds of scraping bots?

This list i’ll keep updating

Sneaker bot: commonly referred to as a “shoe bot”, is a sophisticated software component designed to help individuals quickly purchase limited availability stock.

Tools

Web Scraping projects

Tool Name	Description	Use Case	Links
BrightData	Developer-focused proxy network and scraping infrastructure	Custom scraping solutions	Website
Diffbot	AI-powered structured data extraction API	Market research	Website
ScrapingBee	Headless browser management service	Browser automation	Website
Apify	Cloud-based platform for web scraping and automation	Large-scale data extraction, automation workflows	Website
Octoparse	No-code web scraping tool with a user-friendly interface	Non-technical data collection	Website
Zyte	Formerly Scrapinghub; provides Scrapy framework and managed scraping services	Structured data extraction	Website
SerpAPI	API for accessing Google search results programmatically	Search engine data collection	Website

Web Discovery & Mining & Text Processing

Tool Name	Description	Use Case	Links
Trafilatura	Advanced web scraping library with metadata extraction	Content harvesting	GitHub
Minet	Python webmining toolkit with CLI interface	Large-scale scraping	GitHub
postlight/parser	Mercury parser for web content extraction	Article extraction	GitHub
crawl4ai	Open-Source LLM-Friendly Web Crawler & Scraper
Firecrawl	Open-source tool for extracting clean, LLM-ready data from websites	Web scraping for AI apps	Website
LLM Scraper	TypeScript library for structured web scraping using LLMs	Web data extraction	GitHub
OmniParser	Computer vision tool for parsing UI screenshots into structured data	GUI automation agents	GitHub
simonw/shot-scraper	Takes pixel-perfect screenshots; can be used for change detection
files-to-prompt	Concatenates multiple files into a single prompt for LLM usage	Prepping text for LLM prompts	GitHub
Markitdown	Markdown-based tool for structuring and organizing content	Content formatting	GitHub
defuddle-cli	CLI tool to simplify and clean up messy datasets or files	Data cleanup	GitHub
repomix	Combines multiple code repositories into a single file	Codebase unification	GitHub

Browser automation

Tool Name	Description	Use Case	Links
vimGPT/browserGPT	AI-powered automation tools for editors/browsers	Workflow automation	(Community projects)
Stagehand	AI-assisted browser automation framework	Web testing	GitHub

Change Detection

Tool Name	Description	Use Case	Links
urlwatch	Website change monitoring with multiple notification channels	Content tracking	GitHub
changedetection.io	Self-hosted visual change detection platform	Website monitoring	GitHub
Changd	Open-source web monitoring tool for visual changes, XPath, and API data	Website change monitoring	GitHub
Visualping	Commercial service for monitoring webpage changes with alerts and reports	Business intelligence, compliance	Website

Post-Processing

Tool Name	Description	Use Case	Links
strip-tags	HTML tag stripping utility	Text cleanup	GitHub
mailparser	Advanced email parsing library	Email processing	GitHub

Tool Name	Description	Use Case	Links
twarc2	Official Twitter archiving and analysis toolkit	Social media research	Docs
snscrape	Social media scraping toolkit (multiple platforms)	Public data collection	GitHub
PMAW	Pushshift wrapper for Reddit data	Reddit analysis	GitHub

Miscellaneous Tools

Tool Name	Description	Use Case	Links
browser_cookie3	Browser cookie extraction library	Authentication automation	GitHub
pdf2htmlEX	PDF to HTML converter	Document processing	GitHub

Enumeration & Brute-Force

Tool Name	Description	Use Case	Links
Legba	Advanced network protocol brute-forcing tool	Security testing	Blog

Checklist & Best Practices

Checklist

Using something like wappalyzer find out tech used/projection used etc.
Does the website have an API (internal or exposed)?
Does it have some JSON inside the HTML? Eg. site might preload JSON payloads into the initial HTML for hydration.
Think beyond DOM scraping
- Does it even need scraping or I can just make an API call
- Does it include a static session header?
- Does it include a dynamic session header?
- Does it dump things to the heap that we can use objects from it?
If it’s DOM based scraping and we using Playwright, can we get around using codegen ?
Is the data being served via iframe? in that case we check the source of the frame.
Does it makes certain requests only from mobile app? TODO: How do we catch these?
Is the data being rendered via canvas, so no DOM at all? Maybe tools shot-scraper , ishan0102/vimGPT , OpenAdapt ,mayt/BrowserGPT can help?

Best practices

Sites with dynamic sessions

These usually need complex combination of temporary auth token headers which is difficult to do outside the context of the app/expire etc.
In these cases, we sort of would need to automate the task of “inspecting the network tab”. Application context can help. (See Page.setRequestInterception() , Network Events | Playwright )
Sometimes they may even be predictable in some way.

Sites with data in the runtime Heap

Eg. find the apollo client instance in memory, use it to get the data. Profit? (See adriancooney/puppeteer-heap-snapshot , this will work with playwright as-well because uses the CDP ).
This can be slow but nice because even if the UI changes frequently, the underlying data-structure to store the data might not etc.

DOM based scraping

We try using playwright codegen if possible
Don’t use XPath&CSS selectors at all (Except if you don’t have choice). You rely on more generic stuff, e.g, “the button that has ‘Sign in’ on it”: await page.getByRole('button', { name: 'Sign in' }).click();

Other ideas

Crawlee Primer

currently supports 3 main crawlers
There’s request and requestQueue that crawlee offers. These are low level
Every crawler has an implicit RequestQueue instance, and you can add requests to it with the crawler.addRequests() method.

Playwright notes

Injecting scripts

https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code

await page.addInitScript({
  path: path.join(injectionsDir, "dismissDialog.js"),
});

// or
await page.exposeFunction(isShown.name, isShown);

I think the benefit of exposeFunction is that we get typesafety for the function, otherwise with addInitScript it has to be a proper javascript file(non-ts).

Bot detection

https://github.com/apify/crawlee-python/issues/684 (Camoufox)

Waiting for items to appear

networkidle is discouraged. See https://github.com/microsoft/playwright/issues/22897

Resources

War stories

So… I built a Browser Extension to grab the data at a speed that is usually under their detection rate. Basically created a distributed scraper and passed it out to as many people in the league as I could.
- I found that tampermonkey is often much easier to deal with in most cases and also much quicker to develop for
- some sites can block ‘self’ origin scripts by leaving it out of the CSP and only allowing scripts they control served by a CDN

Others

Antibot stuff

Antibot Protection

If anti-bot detects your fingerprint or you raise suspicion, you get captcha. Idea is to detect which anti-bot mechanism is at play and then use bypassing techniques when scraping. w some anti-bot tools, you may not even need to use headless browser, maybe just using rotating proxies will solve it.

Fingerprinting

See Anonymity

Passive

This is usually not under your control. You can try changing devices etc.
- TCP/IP: IPv4 and IPv6 headers, TCP headers, the dynamics of the TCP handshake, and the contents of application-level payloads. (See p0f )
- TLS : The TLS handshake is not encrypted and can be used for finger printing .
- HTTP : Special frames in the packet that differ by clients so that we can fingerprint the client etc. SETTINGS/WINDOW_UPDATE/PRIORITY for HTTP/2

Active

In this case, the website tries to run certain tests back on you to check if your fingerprint matches and do whatever action it desires to based on that info
- Canvas Fingerprinting: This may try to render something which may render differently in a personal computer vs a vm etc. WebGL Fingerprinting also works similarly.

Products offering protection

Datadome
PerimeterX
Kasada
Cloudflare
- You could also get creative eg. if we can somehow figure out the origin ip somehow(DNS leak, logs, subdomains etc.). But this would only work if the site admin somehow forgot to add firewalls rules to allow only traffic from cf
OSS
- Open-source JavaScript Bot Detection Library
- omrilotan/isbot

Antibot solutions

Proxy services

I’ll just say that firefox still runs tampermonkey, and that includes firefox mobile, so depending on how often you need a different IP and how much data you’re getting, you might be able to do away with the whole idea of proxies and just have a few mobile phones that can be configured as workers that take requests through a tampermonkey script. Or that a laptop tethers to that does the same, or that runs puppeteer itself. It depends on whether a worker needs a new IP every few minutes, hours or days as to whether a real mobile phone works (as some manual interaction is often required to actively change the IP). - kbenson

Residential/Mobile
- How IPs For Web Scraping Are Sourced | Scraping Fish
- Build Your Own Mobile Proxy for Web Scraping | Scraping Fish
4G rotating proxies??

Captcha solvers

Obfuscate fingerprint

May require playing w JS
Manage cookies/headers
Crack backend APIs and so on.

Other configs

There are always specific config that you’ll need to trial and error. eg. some sites might not like headless, so you gotta scrape with no-headless or something similar

Pre-made solutions

These usually do the job of Proxy services + Obfuscating fingerprints
Bright data , Zyte API, Smart Proxy and Oxylabs Web Unlocker