Can RTX 5090 run Phi-4?

Yes — runs locally

~78 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM

32 GB

Model size

14B

Best quant

Q8_0

VRAM needed

15.0 GB

The verdict

The RTX 5090 (32 GB VRAM) handles Phi-4 comfortably using the Q8_0 quantization, which fits in 15.0 GB. Expected throughput is around 78 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Microsoft's 14B parameter model. Punches well above its weight on reasoning.

Setup tutorial: Phi-4 on RTX 5090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Phi-4 runs exceptionally well on the NVIDIA GeForce RTX 5090 with a Grade S performance, using the Q8_0 quantization. Expect ~76 tok/sec with snappy responsiveness.

Prerequisites

Before starting, ensure you have at least 15GB of free disk space, a compatible operating system (Windows or Linux), the latest NVIDIA driver (version 525.60.13 or later), and CUDA 11.8 installed.

Expected performance

With the Q8_0 quantization, you can expect a token generation rate of ~76 tok/sec, utilizing approximately 15.0GB of VRAM. This leaves you with 17.0GB of VRAM headroom, allowing for a practical context window of up to 16384 tokens without running into memory constraints.

1. Install runtimeOllama

pip install ollama
ollama config set device cuda

2. Download the model

Download the Q8_0 quantized Phi-4 model (14.5GB file) from the Hugging Face repository.

ollama pull bartowski/phi-4-GGUF:phi-4-Q8_0.gguf

3. Run it

ollama run phi-4-Q8_0.gguf --interactive
ollama config set n_gpu_layers 14000

4. Optimize for RTX 5090

For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, set --n-gpu-layers to 14000 to fully utilize the GPU memory. Enable flash attention (--flash-attn) to speed up inference and reduce memory usage. Consider using tensor parallelism (--tensor-parallel-size 2) if you have multiple GPUs or want to distribute the load more efficiently.

Troubleshooting

Out of memory error during inference

Reduce the number of layers offloaded to the GPU by setting --n-gpu-layers to a lower value, such as 12000.

Slow token generation

Ensure that flash attention is enabled by adding the --flash-attn flag to your run command.

Inconsistent performance across runs

Check for background processes that may be consuming GPU resources and close them. Also, ensure that the CUDA driver and Ollama runtime are up to date.

Alternative runtimes

While Ollama is the recommended runtime for Phi-4 on the NVIDIA GeForce RTX 5090, you can also consider LM Studio for a more user-friendly interface, llama.cpp for more advanced customization options, or Jan for lightweight deployment scenarios. Choose an alternative based on your specific needs for ease of use, customization, or resource efficiency.

Full Phi-4 details →

Other models that run great on RTX 5090

FAQ (20)

What GPU do I need to run Phi-4?

To run Phi-4, you need a GPU with at least 8.9 GB of VRAM, but 15.0 GB is recommended for optimal performance.

Is Phi-4 good for coding?

Yes, Phi-4 is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 16,384 tokens.

Phi-4 vs Llama 3.1 8B?

Phi-4 has 14 billion parameters compared to Llama 3.1's 8 billion, making it more powerful for complex tasks but requiring more VRAM.

Can I run Phi-4 on a Mac?

Yes, you can run Phi-4 on a Mac with a compatible GPU, such as an AMD or NVIDIA card with sufficient VRAM.

How much VRAM does Phi-4 need?

Phi-4 requires between 8.9 GB and 15.0 GB of VRAM, depending on the quantization level used.

Is Phi-4 censored?

Phi-4 is not inherently censored, but its outputs can be filtered based on the implementation and configuration settings.

Is Phi-4 commercial-use allowed?

Yes, Phi-4 is licensed under the MIT License, which allows for commercial use without restriction.

Phi-4 context length?

Phi-4 has a context length of 16,384 tokens, allowing it to handle longer sequences of text effectively.

Want personalized recommendations for your exact setup? Detect my hardware →