Computer Vision

tags: Machine Learning , NLP (Natural Language Processing) , OCR

Vision tasks

Task	Variations
Object detection	Real time, non-realtime, rotated, captioning, classification
Object tracking(video)	Similar to object detection but with some additional checks
Segmentation	SAM2 etc.
Contrastive Learning
Distillation
VDU (Visual Document Understanding)	OCR, OCR-free
VQA (Visual QA)
Pose estimation

Image generation

Object Detection

Architecture	Type	Name	Use	Other notes
Transformer	VLM	LLaVA	Visual Q/A, alternative to GPT-4V
	VLM	moondream	same as LLaVa	can’t do ocr probably
	VLM	CogVLM	same as LLaVa	Better than LLaVa in captioning
	ViT	CLIP	txt-guided imagen, classification, caption
	ViT	BLIP	Same as CLIP, better than CLIP at captioning	Considered faster than CLIP?
	ViT	DETIC
	ViT	GDINO	Better at detection than CLIP	similar to YOLO but slower
CNN	1 stage	YOLO	Realtime object identification	No involvement of anything NLP like VLMs
	2 stage	Detectron-2		Apache license, Fast-RCNN
		EfficientNetV2	classification

See Image Compression for ideas around perceptual hashing.

Theory

What are some foundational papers in CV that every newcomer should read? : computervision
Object Detection | Papers With Code
A Dive into Vision-Language Models
Closed-set: Detect from trained stuff. Eg. Find all dogs in the image
Open-set: Detects un-trained stuff. Eg. Find the right-most dog, is a person holding a dog? (Transformer based work nicely here)
- In other words, allows to do zero-shot object detection

CNN based

CNN uses pixel arrays

YOLO! (?)

Lot of crazy politics. anyone is coming up with anything. The newer version doesn’t mean the newer version of the same thing. superr confusing. Original author left the chat long back cuz ethical reasons

“The YOLOv5, YOLOv6, and YOLOv7 teams all say their versions are faster and more accurate than the rest, and for what it’s worth, the teams for v4 and v7 overlap, and the implementation for v7 is based on v5. At the end of the day, the only benchmarks that really matter to us are ones using our data and hardware”

YOLO algorithm treats object detection as a regression problem, utilizing a single convolutional neural network to spatially separate bounding boxes and associate probabilities with detected objects.
YOLO is a family of detection algorithms made by, at times, totally different groups of people.
See Programming Comments - Darknet FAQ
See [2304.00501] A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS

Year	Name	Description
	Darknet/YOLO	Someone says it’s faster than the newer version, idk tf they talking about. Goes upto YOLOv7
2015	YOLOv1	Improved on R-CNN/ Fast R-CNN by doing 1 CNN archtecture, made things real fast
2016	YOLOv2	Improved on v1 (Darknet-19), anchor boxes
2018	YOLOv3	Improved on v2 (Darknet-53), NMS was added
	YOLOv4	Added CSPNet, k-means, GHM loss etc.
	YOLOv5
	YOLOv6
	YOLOv7
	YOLOv8
	YOLOv8-n
	YOLOv8-s
	YOLOv8-m
	YOLOv10
	YOLO-X	Based on YOLOv3 but adds features as anchor free and other things.
	YOLO NAS	for detecting small objects, suitable for edge devices
	YOLO-World

YOLO vs older CNN based models
From a reddit comment
- the R-CNN family:
  - Find the interesting regions
  - For every interesting region: What object is in the region?
  - Remove overlapping and low score detections
- YOLO/SSD:
  - Come up with a fixed grid of regions
  - Predict N objects in every region all at once
  - same as above

MMLabs 🌟
- mmdetection (MMLabs)
- More like a framework for vision models.
- Good choice if you’re just experimenting with a model right now. The SOTA model can be trained via config alone.
  - Has both CNN and Transformer based stuff
- It also has YOLO model variants: https://github.com/open-mmlab/mmyolo

Doubts
- ResNet?

Transformer (Vision Encoder, ViT)

See VLM(Vision Language Models)

Main pupose: extract visual features into embeddings

ViT splits the input images into visual tokens

divides an image into fixed-size patches

correctly embeds each of them

includes positional embedding as an input to the transformer encoder

The OG here is ViT (Google Brain team)
- Downstream variants include: BEiT, DeiT, Swin, CSWIn(better than Swin), MAE, DINO(Improved DETR)
- Prior work before ViT was DETR
Has outperformed CNN models in certain cases
- ViT models outperform the current SOTA CNNs by almost x4 in terms of computational efficiency and accuracy.
Examples of implementation: CLIP, GDINO
- ViT is what CLIP uses. In other words, CLIP is possible because of ViT.
- GroundingDINO(GDINO) uses DINO
  - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs - YouTube (combining DINO+CLIP for better results)
this blogpost has simple explanation of the architecture

Combining Transformer based + CNN based

This is only useful if you need super fast inference, low on compute inference etc. Otherwise for most cases GDINO/CLIP etc goto.

Now CNN based inference is faster than transformer based. So something like YOLO is still more preferable for realtime stuff.
But we can use GDINO to generate label for our training dataset and then we can use this to train our YOLO models which will be fast.
- Essentially, use Transformer based detection for labeling & training the CNN model
- Use the CNN model to do fast inference in production
Basically using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset.
https://github.com/autodistill/autodistill allows to do exactly this.
see https://www.youtube.com/@Roboflow/videos

Segmentation

Name	Description
SEEM
SAM

To improve segmentation we can tune the params, else we can also use some kind of object detection(eg. yolo etc) to draw bounding boxes before we apply segmentation to it. See this thread for more info.

Visual Document Understanding (VDU)

OCR
- 2-stage pipeline: Usually when trying to understand a document, we’d do OCR and then run though another process for the understanding.
- Issue: Mostly with OCR, the result might not be what you want. Eg. No spatial understanding ( even different line etc). Using a OCR free approach might help.
- See OCR
OCR-free
- 1 stage pipeline: OCR and understanding in one
- Eg. Donut , (Document understanding transformer), LayoutLM (reciept understanding)
  - Some of the VLMs can do this as-well.

OpenCV

https://archive.ph/2024.07.23-031436/https://scottsexton.co/post/opencv_wild_kingdom/

VLMs

See VLM(Vision Language Models)

3D

https://news.ycombinator.com/item?id=43589989 https://github.com/VAST-AI-Research/TripoSG

mogoz

Computer Vision

Meta

Datasets

Vision tasks

Image generation

Object Detection

Theory

CNN based

Transformer (Vision Encoder, ViT)

Combining Transformer based + CNN based

Segmentation

Visual Document Understanding (VDU)

OpenCV

VLMs

3D

Others

Resources

Links to this note