Best vision models

Local GPT-4V replacements

Image-in / text-out — describe screenshots, parse documents, count objects, read receipts. All work fully offline.

1
LLaVA
LLaVA 1.6 7B
Multimodal vision-language model. Understands images and answers questions about them.
7B≥ 5 GB
2
Alibaba
Qwen2-VL 2B
Compact vision-language model. Default multimodal model. Can understand images and answer questions about them.
2.2B≥ 1.42 GB
3
OpenBMB
MiniCPM-V 2.6
Efficient multimodal model with strong image understanding. Optimized for edge devices.
2B≥ 2.1 GB
4
Microsoft
Phi-3.5 Vision
Vision-language model from Microsoft. Can understand images and documents.
4.2B≥ 3.2 GB
5
Moondream
Moondream 2
Ultra-compact vision model. Only 1GB. Answers questions about images.
1.8B≥ 1.5 GB
6
Google
PaliGemma 3B
Google's vision model. Strong at visual QA, captioning, and OCR.
3B≥ 2.5 GB

Not sure which fits your machine? Auto-detect your hardware →