Vision & Multimodal AI Models

Vision and multimodal models can understand both text and images, enabling tasks like image captioning, visual question answering, document understanding, and optical character recognition. These models accept image inputs alongside text prompts and generate text responses describing or analyzing the visual content. They range from lightweight 2B parameter models to full-featured 7B+ models.

6models available

1.4GB min VRAM needed

Alibaba

Qwen2-VL 2B

Compact vision-language model. Default multimodal model. Can understand images and answer questions about them.

Vision2.2B1.42-2.03GB VRAM

3.7M downloads2 quants

Moondream

Moondream 2

Ultra-compact vision model. Only 1GB. Answers questions about images.

Vision1.8B1.5GB VRAM

2.8M downloads1 quants

Microsoft

Phi-3.5 Vision

Vision-language model from Microsoft. Can understand images and documents.

Vision4.2B3.2GB VRAM

2.1M downloads1 quants

LLaVA

LLaVA 1.6 7B

Multimodal vision-language model. Understands images and answers questions about them.

Vision7B5-8.5GB VRAM

531.3K downloads2 quants

Google

PaliGemma 3B

Google's vision model. Strong at visual QA, captioning, and OCR.

Vision3B2.5GB VRAM

219.9K downloads1 quants

OpenBMB

MiniCPM-V 2.6

Efficient multimodal model with strong image understanding. Optimized for edge devices.

Vision2B2.1-3GB VRAM

126.7K downloads2 quants

Browse Other Capabilities

Uncensored & Abliterated AI Models

18 models

Coding AI Models

17 models

Embedding Models

5 models

Speech-to-Text Models

9 models

Image Generation Models

9 models

Multilingual AI Models

52 models

Small AI Models (Under 3B)

63 models

© runthismodel · 2026privacy terms disclaimer changelog embed badge runpod vast.ai huggingface ollama lm-studiomade for the people who actually read GGUF metadata

 ┌─┐                ╔══╗     ╔══╗
 │░│  RUN  THIS  M  ║▓▓║ DEL ║▓▓║
 └─┘                ╚══╝     ╚══╝