Vision & Multimodal AI Models

Vision and multimodal models can understand both text and images, enabling tasks like image captioning, visual question answering, document understanding, and optical character recognition. These models accept image inputs alongside text prompts and generate text responses describing or analyzing the visual content. They range from lightweight 2B parameter models to full-featured 7B+ models.

6models available
1.4GB min VRAM needed

Browse Other Capabilities