OCR

tags: Machine Learning , Image Compression , Computer Vision , Deploying ML applications (applied ML)

Comparison

Type	Name	Description
Service	Claude/OpenAI/AWS	They have APIs

LSTM-CNN	Tesseract
PP-OCR(DB+CRNN)	PaddleOCR	Works with rotated stuff
	EasyOCR
Toolbox, Modular models	doctr	Some people mention it works better than paddle and tesseract.
Pytorch+mmlabs	MMOCR	Might be nice if using mmdetection stuff
	surya	Only for documents, doesn’t work in handwritten. faster than tesseract, Language support. Tries to guess proper reading order.
VLM	MGP-STR	new kid (2024)
VLM	GOT	new kid (2024)
VLM	olmOCR	olmOCR – Open-Source OCR for Accurate Document Conversion (has comparision to GOT)
VLM	ROlmOCR	better and faster olmOCR
VLM	TrOCR
VLM	DONUT
VLM	InternVL
VLM	Idefics2

olmOCR introduces a technique they call “Document Anchoring”, where the quality of the extracted text is enhanced with any text and metadata present in the PDF file.

Resources

RolmOCR-7B follows same recipe with OlmOCR, builds on Qwen2.5VL
- https://huggingface.co/reducto/RolmOCR
MGP-STR : Better than EasyOCR it seems
stepfun-ai/GOT-OCR2_0 · Hugging Face
- Due to the use of opt-125 and a few other elements it is not allowed for commercial use. Otherwise they have code for inference on the hf card as well.
Show HN: Gogosseract, a Go Lib for CGo-Free Tesseract OCR via Wazero | Hacker News
Qwen2-VL-7B Instruct model gets 100% accuracy extracting text from this handwritten document
Mistral OCR | Hacker News
- Mistral OCR: Revolutionary or Just Hype? - YouTube
https://github.com/facebookresearch/nougat
https://github.com/VikParuchuri/marker
Run a job queue for GOT-OCR | Modal Docs
Benchmarking vision-language models on OCR in dynamic video environments | Hacker News 🌟
- Show HN: Benchmarking VLMs vs. Traditional OCR | Hacker News
OCR4all | Hacker News
Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual) | Hacker News
edge
- we use clip embeddings in our first product for a lot of our search capabilities, but it’s mostly interesting because of its carryover to other parts of the app.
- we also do a lot of OCR work, so in order to filter down candidate images to look at and speed up preprocessing, we trained a small MLP to take in the preprocessed clip embeddings (instead of raw images) and predict whether they contain text.
- the classifier has an f1 score of 0.98. it takes 2-3min to train on a consumer laptop (given our dataset of 30k embeddings + 60k synthetic), it’s 300kb on device, and it runs at 10k fps, so it can absolutely rip through a photo library.
- so now instead of running useless OCR vision requests on images with no legible text, we can just skip them up front. for example on my library, in my last 5k photos nearly 2k have no text we can completely skip over processing. the fastest way to speed up work is to do no work at all!

mogoz

OCR

Comparison

Resources

Links to this note