Best vision models

Local GPT-4V replacements

Image-in / text-out — describe screenshots, parse documents, count objects, read receipts. All work fully offline.

  1. 1

    LLaVA

    LLaVA 1.6 7B

    Multimodal vision-language model. Understands images and answers questions about them.

    7B5 GB
  2. 2

    Alibaba

    Qwen2-VL 2B

    Compact vision-language model. Default multimodal model. Can understand images and answer questions about them.

    2.2B1.42 GB
  3. 3

    OpenBMB

    MiniCPM-V 2.6

    Efficient multimodal model with strong image understanding. Optimized for edge devices.

    2B2.1 GB
  4. 4

    Microsoft

    Phi-3.5 Vision

    Vision-language model from Microsoft. Can understand images and documents.

    4.2B3.2 GB
  5. 5

    Moondream

    Moondream 2

    Ultra-compact vision model. Only 1GB. Answers questions about images.

    1.8B1.5 GB
  6. 6

    Google

    PaliGemma 3B

    Google's vision model. Strong at visual QA, captioning, and OCR.

    3B2.5 GB

Not sure which fits your machine? Auto-detect your hardware →