Can RTX 5090 run Phi-3.5 Mini 3.8B?

Yes — runs locally

~168 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM

32 GB

Model size

3.8B

Best quant

Q8_0

VRAM needed

4.3 GB

The verdict

The RTX 5090 (32 GB VRAM) handles Phi-3.5 Mini 3.8B comfortably using the Q8_0 quantization, which fits in 4.3 GB. Expected throughput is around 168 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Tiny but capable 3.8B model. Runs on almost any hardware including phones.

Setup tutorial: Phi-3.5 Mini 3.8B on RTX 5090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

The Phi-3.5 Mini 3.8B model runs at Grade S on the NVIDIA GeForce RTX 5090 with Q8_0 quantization, achieving ~356 tok/sec.

Prerequisites

Before starting, ensure you have at least 4GB of free disk space, a 64-bit version of Windows or Linux, NVIDIA driver version 525.60 or later, and CUDA 11.8 or later installed.

Expected performance

With the recommended settings, you can expect the Phi-3.5 Mini 3.8B model to achieve ~356 tok/sec, using 4.3GB of VRAM. Given the remaining 27.7GB of VRAM, you can maintain a large context window of up to 131072 tokens, making it suitable for long-form text generation and complex tasks.

1. Install runtimeOllama

pip install ollama
ollama init

2. Download the model

Download the Q8_0 quantized Phi-3.5 Mini 3.8B model (3.8GB file) from Hugging Face.

ollama pull bartowski/Phi-3.5-mini-instruct-GGUF:Phi-3.5-mini-instruct-Q8_0.gguf

3. Run it

ollama run Phi-3.5-mini-instruct-Q8_0.gguf --n-gpu-layers 32 --flash-attn --context-length 131072

4. Optimize for RTX 5090

For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, use the --n-gpu-layers 32 flag to offload layers to the GPU, enable --flash-attn for faster attention computation, and set --context-length to 131072 to maximize the context window. With 4.3GB VRAM in use, you will have 27.7GB of VRAM headroom, allowing for a large practical context window.

Troubleshooting

Out of memory errors during inference

Reduce the --n-gpu-layers value to 24 or 16 to lower VRAM usage.

Slow token generation speed

Ensure that --flash-attn is enabled and that your CUDA drivers are up to date.

Model fails to load

Verify that the model file is correctly downloaded and not corrupted. Try re-downloading the model.

Alternative runtimes

For users preferring different runtimes, consider LM Studio for a more user-friendly interface, llama.cpp for fine-grained control over quantization and performance tuning, or Jan for a lightweight, easy-to-deploy solution. Ollama is recommended for its ease of use and robust performance on the NVIDIA GeForce RTX 5090.

Full Phi-3.5 Mini 3.8B details →

Other models that run great on RTX 5090

FAQ (20)

What GPU do I need to run Phi-3.5 Mini 3.8B?

Phi-3.5 Mini 3.8B requires a GPU with at least 2.7 GB of VRAM, but 4.3 GB is recommended for optimal performance.

Is Phi-3.5 Mini 3.8B good for coding?

Phi-3.5 Mini 3.8B is capable of generating code and providing coding assistance, but its performance is best suited for simpler tasks due to its 3.8B parameters.

Phi-3.5 Mini 3.8B vs Llama 3.1 8B?

Phi-3.5 Mini 3.8B has 3.8B parameters, making it smaller and more resource-efficient than Llama 3.1 8B, which has 8B parameters and requires more VRAM and computational power.

Can I run Phi-3.5 Mini 3.8B on a Mac?

Yes, Phi-3.5 Mini 3.8B can run on a Mac, provided your Mac has a compatible GPU with at least 2.7 GB of VRAM.

How much VRAM does Phi-3.5 Mini 3.8B need?

Phi-3.5 Mini 3.8B requires a minimum of 2.7 GB of VRAM, but 4.3 GB is recommended for better performance, depending on the quantization level.

Is Phi-3.5 Mini 3.8B censored?

Phi-3.5 Mini 3.8B is not inherently censored, but it may include content filters to prevent harmful or inappropriate content.

Is Phi-3.5 Mini 3.8B commercial-use allowed?

Yes, Phi-3.5 Mini 3.8B is licensed under the MIT License, which allows for commercial use.

Phi-3.5 Mini 3.8B context length?

Phi-3.5 Mini 3.8B supports a context length of 131,072 tokens, which is quite large and allows for extensive context in conversations and tasks.

Want personalized recommendations for your exact setup? Detect my hardware →