~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can RTX 5090 run Phi-3.5 Vision?

S

Yes — runs locally

~168 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM
32 GB
Model size
4.2B
Best quant
Q4_K_M
VRAM needed
3.2 GB

The verdict

The RTX 5090 (32 GB VRAM) handles Phi-3.5 Vision comfortably using the Q4_K_M quantization, which fits in 3.2 GB. Expected throughput is around 168 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Vision-language model from Microsoft. Can understand images and documents.

Setup tutorial: Phi-3.5 Vision on RTX 5090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Phi-3.5 Vision runs at Grade S on the NVIDIA GeForce RTX 5090 with Q4_K_M quantization, achieving ~466 tok/sec.

Prerequisites

Before starting, ensure you have at least 5GB of free disk space, a compatible OS (Windows 10/11 or Linux), and the latest NVIDIA drivers (version 525.60.13 or later) with CUDA 11.8 installed.

Expected performance

With the Q4_K_M quantization, you can expect Phi-3.5 Vision to run at ~466 tok/sec, utilizing approximately 3.2GB of VRAM. This leaves 28.8GB of VRAM available for context, allowing for a practical context window of up to 131,072 tokens given the remaining VRAM.

1. Install runtimeOllama

pip install ollama
ollama init

2. Download the model

Download the Q4_K_M quantized version of Phi-3.5 Vision (2.5GB file) from Hugging Face.

ollama pull abetlen/Phi-3.5-vision-instruct-gguf:Phi-3.5-vision-instruct-Q4_K_M.gguf

3. Run it

ollama run --model abetlen/Phi-3.5-vision-instruct-gguf --quant Q4_K_M --context-length 131072
ollama interactive

4. Optimize for RTX 5090

For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, use the --n-gpu-layers flag to offload layers to the GPU, enable flash-attn for faster attention computation, and consider using tensor parallelism to distribute the workload across multiple GPUs if available. The --n-gpu-layers 32 setting is recommended to balance performance and memory usage.

Troubleshooting

Out of memory errors during inference

Reduce the --n-gpu-layers value or increase the batch size to better manage memory usage.

Slow inference speed

Ensure flash-attn is enabled and check that your CUDA installation is up to date.

Model fails to load

Verify the integrity of the downloaded model file and try re-downloading it.

Alternative runtimes

Alternative runtimes like LM Studio, llama.cpp, and Jan can be used for more specialized setups. LM Studio offers a graphical interface and is suitable for users who prefer a GUI. llama.cpp provides more fine-grained control over quantization and is ideal for advanced users. Jan is lightweight and efficient, making it a good choice for systems with limited resources. However, Ollama is generally recommended for its ease of use and robust performance on the NVIDIA GeForce RTX 5090.

Other models that run great on RTX 5090

FAQ (20)

What GPU do I need to run Phi-3.5 Vision?

To run Phi-3.5 Vision, you need a GPU with at least 3.2 GB of VRAM. Higher VRAM will improve performance, especially for larger tasks.

Is Phi-3.5 Vision good for coding?

Phi-3.5 Vision is primarily designed for vision and language tasks, such as understanding images and documents. It may not be as optimized for coding-specific tasks compared to models like Codex or CodeLlama.

Phi-3.5 Vision vs Llama 3.1 8B?

Phi-3.5 Vision has 4.2 billion parameters and is specialized for vision-language tasks, while Llama 3.1 8B is a text-only model with 8 billion parameters, making it more versatile for text generation but less suited for image understanding.

Can I run Phi-3.5 Vision on a Mac?

Yes, you can run Phi-3.5 Vision on a Mac, but ensure your Mac has a compatible GPU with at least 3.2 GB of VRAM. Apple Silicon GPUs may require additional drivers or software.

How much VRAM does Phi-3.5 Vision need?

Phi-3.5 Vision requires 3.2 GB of VRAM, which is consistent across different quantization levels. More VRAM can help with larger batch sizes and more complex tasks.

Is Phi-3.5 Vision censored?

Phi-3.5 Vision is not inherently censored, but it adheres to ethical guidelines and may have filters to prevent harmful content. Users can configure additional safety measures as needed.

Is Phi-3.5 Vision commercial-use allowed?

Yes, Phi-3.5 Vision is licensed under the MIT License, which allows for commercial use. However, always review the specific terms of the license to ensure compliance.

Phi-3.5 Vision context length?

Phi-3.5 Vision has a context length of 131,072 tokens, allowing it to process very long sequences of text and images effectively.

Want personalized recommendations for your exact setup? Detect my hardware →