mogoz

Scraping

tags
System Design ,Archival

FAQ

Good resources?

Headless non-headless

  • headless: no gui (eg. webscraping)
  • non-headless: gui, visual rendering (eg. if user needs to keep seeing what the automation does)

What are diff kinds of scraping bots?

This list i’ll keep updating

  • Sneaker bot: commonly referred to as a “shoe bot”, is a sophisticated software component designed to help individuals quickly purchase limited availability stock.

Tools

Web Scraping projects

Tool Name Description Use Case Links
BrightData Developer-focused proxy network and scraping infrastructure Custom scraping solutions Websiteexternal link
Diffbot AI-powered structured data extraction API Market research Websiteexternal link
ScrapingBee Headless browser management service Browser automation Websiteexternal link
Apify Cloud-based platform for web scraping and automation Large-scale data extraction, automation workflows Websiteexternal link
Octoparse No-code web scraping tool with a user-friendly interface Non-technical data collection Websiteexternal link
Zyte Formerly Scrapinghub; provides Scrapy framework and managed scraping services Structured data extraction Websiteexternal link
SerpAPI API for accessing Google search results programmatically Search engine data collection Websiteexternal link

Web Discovery & Mining & Text Processing

Tool Name Description Use Case Links
Trafilatura Advanced web scraping library with metadata extraction Content harvesting GitHubexternal link
Minet Python webmining toolkit with CLI interface Large-scale scraping GitHubexternal link
postlight/parser Mercury parser for web content extraction Article extraction GitHubexternal link
crawl4aiexternal link Open-Source LLM-Friendly Web Crawler & Scraper
Firecrawl Open-source tool for extracting clean, LLM-ready data from websites Web scraping for AI apps Websiteexternal link
LLM Scraper TypeScript library for structured web scraping using LLMs Web data extraction GitHubexternal link
OmniParser Computer vision tool for parsing UI screenshots into structured data GUI automation agents GitHubexternal link
simonw/shot-scraperexternal link Takes pixel-perfect screenshots; can be used for change detection
files-to-prompt Concatenates multiple files into a single prompt for LLM usage Prepping text for LLM prompts GitHubexternal link
Markitdown Markdown-based tool for structuring and organizing content Content formatting GitHubexternal link
defuddle-cli CLI tool to simplify and clean up messy datasets or files Data cleanup GitHubexternal link
repomix Combines multiple code repositories into a single file Codebase unification GitHubexternal link

Browser automation

Tool Name Description Use Case Links
vimGPT/browserGPT AI-powered automation tools for editors/browsers Workflow automation (Community projects)
Stagehand AI-assisted browser automation framework Web testing GitHubexternal link

Change Detection

Tool Name Description Use Case Links
urlwatch Website change monitoring with multiple notification channels Content tracking GitHubexternal link
changedetection.io Self-hosted visual change detection platform Website monitoring GitHubexternal link
Changd Open-source web monitoring tool for visual changes, XPath, and API data Website change monitoring GitHubexternal link
Visualping Commercial service for monitoring webpage changes with alerts and reports Business intelligence, compliance Websiteexternal link

Post-Processing

Tool Name Description Use Case Links
strip-tags HTML tag stripping utility Text cleanup GitHubexternal link
mailparser Advanced email parsing library Email processing GitHubexternal link

Social Media Tools

Tool Name Description Use Case Links
twarc2 Official Twitter archiving and analysis toolkit Social media research Docsexternal link
snscrape Social media scraping toolkit (multiple platforms) Public data collection GitHubexternal link
PMAW Pushshift wrapper for Reddit data Reddit analysis GitHubexternal link

Miscellaneous Tools

Tool Name Description Use Case Links
browser_cookie3 Browser cookie extraction library Authentication automation GitHubexternal link
pdf2htmlEX PDF to HTML converter Document processing GitHubexternal link

Enumeration & Brute-Force

Tool Name Description Use Case Links
Legba Advanced network protocol brute-forcing tool Security testing Blogexternal link

Checklist & Best Practices

Checklist

  • Using something like wappalyzer find out tech used/projection used etc.
  • Does the website have an API (internal or exposed)?
  • Does it have some JSON inside the HTML? Eg. site might preload JSON payloads into the initial HTML for hydration.
  • Think beyond DOM scraping
    • Does it even need scraping or I can just make an API call
    • Does it include a static session header?
    • Does it include a dynamic session header?
    • Does it dump things to the heap that we can use objects from it?
  • If it’s DOM based scraping and we using Playwright, can we get around using codegenexternal link ?
  • Is the data being served via iframe? in that case we check the source of the frame.
  • Does it makes certain requests only from mobile app? TODO: How do we catch these?
  • Is the data being rendered via canvas, so no DOM at all? Maybe tools shot-scraperexternal link , ishan0102/vimGPTexternal link , OpenAdaptexternal link ,mayt/BrowserGPTexternal link can help?

Best practices

Sites with dynamic sessions

Sites with data in the runtime Heap

DOM based scraping

  • We try using playwright codegen if possible
  • Don’t use XPath&CSS selectors at all (Except if you don’t have choice). You rely on more generic stuff, e.g, “the button that has ‘Sign in’ on it”: await page.getByRole('button', { name: 'Sign in' }).click();

Other ideas

Crawlee Primer

  • currently supports 3 main crawlers
  • There’s request and requestQueue that crawlee offers. These are low level
  • Every crawler has an implicit RequestQueue instance, and you can add requests to it with the crawler.addRequests() method.

Playwright notes

Injecting scripts

https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code

await page.addInitScript({
  path: path.join(injectionsDir, "dismissDialog.js"),
});

// or
await page.exposeFunction(isShown.name, isShown);

I think the benefit of exposeFunction is that we get typesafety for the function, otherwise with addInitScript it has to be a proper javascript file(non-ts).

Bot detection

Waiting for items to appear

Resources

War stories

  • So… I built a Browser Extension to grab the data at a speed that is usually under their detection rate. Basically created a distributed scraper and passed it out to as many people in the league as I could.
    • I found that tampermonkey is often much easier to deal with in most cases and also much quicker to develop for
    • some sites can block ‘self’ origin scripts by leaving it out of the CSP and only allowing scripts they control served by a CDN

Others

Antibot stuff

Antibot Protection

If anti-bot detects your fingerprint or you raise suspicion, you get captcha. Idea is to detect which anti-bot mechanism is at play and then use bypassing techniques when scraping. w some anti-bot tools, you may not even need to use headless browser, maybe just using rotating proxies will solve it.external link

Fingerprinting

See Anonymity

  • Active

    In this case, the website tries to run certain tests back on you to check if your fingerprint matches and do whatever action it desires to based on that info

    • Canvas Fingerprinting: This may try to render something which may render differentlyexternal link in a personal computer vs a vm etc. WebGL Fingerprinting also works similarly.

Products offering protection

Antibot solutions

Proxy services

I’ll just say that firefox still runs tampermonkey, and that includes firefox mobile, so depending on how often you need a different IP and how much data you’re getting, you might be able to do away with the whole idea of proxies and just have a few mobile phones that can be configured as workers that take requests through a tampermonkey script. Or that a laptop tethers to that does the same, or that runs puppeteer itself. It depends on whether a worker needs a new IP every few minutes, hours or days as to whether a real mobile phone works (as some manual interaction is often required to actively change the IP). - kbenson

Captcha solvers

Obfuscate fingerprint

  • May require playing w JS
  • Manage cookies/headers
  • Crack backend APIs and so on.

Other configs

  • There are always specific config that you’ll need to trial and error. eg. some sites might not like headless, so you gotta scrape with no-headless or something similar

Pre-made solutions

  • These usually do the job of Proxy services + Obfuscating fingerprints
  • Bright dataexternal link , Zyte API, Smart Proxy and Oxylabs Web Unlocker