Vision & Multimodal AI Models
Vision and multimodal models can understand both text and images, enabling tasks like image captioning, visual question answering, document understanding, and optical character recognition. These models accept image inputs alongside text prompts and generate text responses describing or analyzing the visual content. They range from lightweight 2B parameter models to full-featured 7B+ models.
Moondream
Moondream 2
Ultra-compact vision model. Only 1GB. Answers questions about images.
Alibaba
Qwen2-VL 2B
Compact vision-language model. Default multimodal model. Can understand images and answer questions about them.
Microsoft
Phi-3.5 Vision
Vision-language model from Microsoft. Can understand images and documents.
LLaVA
LLaVA 1.6 7B
Multimodal vision-language model. Understands images and answers questions about them.
PaliGemma 3B
Google's vision model. Strong at visual QA, captioning, and OCR.
OpenBMB
MiniCPM-V 2.6
Efficient multimodal model with strong image understanding. Optimized for edge devices.