mogoz

Computer Vision

tags
Machine Learning , NLP (Natural Language Processing) , OCR

Meta

Datasets

  • ImageNet
  • COCO

Vision tasks

Task Variations
Object detection Real time, non-realtime, rotated, captioning, classification
Object tracking(video) Similar to object detection but with some additional checks
Segmentation SAM2 etc.
Contrastive Learning
Distillation
VDU (Visual Document Understanding) OCR, OCR-free
VQA (Visual QA)
Pose estimation

Image generation

Object Detection

Architecture Type Name Use Other notes
Transformer VLM LLaVAexternal link Visual Q/A, alternative to GPT-4V
VLM moondreamexternal link same as LLaVa can’t do ocr probably
VLM CogVLM same as LLaVa Better than LLaVa in captioning
ViT CLIP txt-guided imagen, classification, caption
ViT BLIP Same as CLIP, better than CLIP at captioning Considered faster than CLIP?
ViT DETICexternal link
ViT GDINO Better at detectionexternal link than CLIP similar to YOLO but slower
CNN 1 stage YOLO Realtime object identification No involvement of anything NLP like VLMs
2 stage Detectron-2 Apache license, Fast-RCNN
EfficientNetV2 classification

Theory

CNN based

  • CNN uses pixel arrays
  • YOLO! (?)

    Lot of crazy politics. anyone is coming up with anything. The newer version doesn’t mean the newer version of the same thing. superr confusing. Original author left the chat long back cuz ethical reasons

    “The YOLOv5, YOLOv6, and YOLOv7 teams all say their versions are faster and more accurate than the rest, and for what it’s worth, the teams for v4 and v7 overlap, and the implementation for v7 is based on v5. At the end of the day, the only benchmarks that really matter to us are ones using our data and hardware”

    Year Name Description
    Darknet/YOLO Someone says it’s faster than the newerexternal link version, idk tf they talking about. Goes upto YOLOv7
    2015 YOLOv1 Improved on R-CNN/ Fast R-CNN by doing 1 CNN archtecture, made things real fast
    2016 YOLOv2 Improved on v1 (Darknet-19), anchor boxes
    2018 YOLOv3 Improved on v2 (Darknet-53), NMS was added
    YOLOv4 Added CSPNet, k-means, GHM loss etc.
    YOLOv5
    YOLOv6
    YOLOv7
    YOLOv8
    YOLOv8-n
    YOLOv8-s
    YOLOv8-m
    YOLOv10
    YOLO-X Based on YOLOv3 but adds features as anchor free and other things.
    YOLO NAS for detecting small objects, suitable for edge devices
    YOLO-World
    • YOLO vs older CNN based models

      From a reddit comment

      • the R-CNN family:
        • Find the interesting regions
        • For every interesting region: What object is in the region?
        • Remove overlapping and low score detections
      • YOLO/SSD:
        • Come up with a fixed grid of regions
        • Predict N objects in every region all at once
        • same as above
  • MMLabs 🌟

  • Doubts

    • ResNet?

Transformer (Vision Encoder, ViT)

See VLM(Vision Language Models)

  • Main pupose: extract visual features into embeddings
  • ViT splits the input images into visual tokens
    • divides an image into fixed-size patches
    • correctly embeds each of them
    • includes positional embedding as an input to the transformer encoder

Combining Transformer based + CNN based

This is only useful if you need super fast inference, low on compute inference etc. Otherwise for most cases GDINO/CLIP etc goto.

  • Now CNN based inference is faster than transformer based. So something like YOLO is still more preferable for realtime stuff.
  • But we can use GDINO to generate label for our training dataset and then we can use this to train our YOLO models which will be fast.
    • Essentially, use Transformer based detection for labeling & training the CNN model
    • Use the CNN model to do fast inference in production
  • Basically using foundation models to train fine-tuned models. The foundation model acts as an automatic labeling tool, then you can use that model to get your dataset.
  • https://github.com/autodistill/autodistill allows to do exactly this.
  • see https://www.youtube.com/@Roboflow/videos

Segmentation

Name Description
SEEMexternal link
SAMexternal link
  • To improve segmentation we can tune the params, else we can also use some kind of object detection(eg. yolo etc) to draw bounding boxes before we apply segmentation to it. See this threadexternal link for more info.

Visual Document Understanding (VDU)

  • OCR
    • 2-stage pipeline: Usually when trying to understand a document, we’d do OCR and then run though another process for the understanding.
    • Issue: Mostly with OCR, the result might not be what you want. Eg. No spatial understanding ( even different line etc). Using a OCR free approach might help.
    • See OCR
  • OCR-free

OpenCV

VLMs

See VLM(Vision Language Models)

3D

https://news.ycombinator.com/item?id=43589989 https://github.com/VAST-AI-Research/TripoSG

Others

Resources