Can RTX 4090 run Phi-4?

Yes — runs locally

~66 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM

24 GB

Model size

14B

Best quant

Q8_0

VRAM needed

15.0 GB

The verdict

The RTX 4090 (24 GB VRAM) handles Phi-4 comfortably using the Q8_0 quantization, which fits in 15.0 GB. Expected throughput is around 66 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Microsoft's 14B parameter model. Punches well above its weight on reasoning.

Setup tutorial: Phi-4 on RTX 4090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Phi-4 (14B parameters) on an NVIDIA GeForce RTX 4090 with Q8_0 quantization for Grade S performance at ~57 tok/sec. Requires 15.0GB VRAM, leaving 9.0GB for context.

Prerequisites

Before starting, ensure you have at least 15GB of free disk space, a compatible operating system (Windows or Linux), the latest NVIDIA driver (version 525.60.13 or later), and CUDA 11.8 or later installed.

Expected performance

With the Q8_0 quantization, you can expect the model to run at approximately 57 tok/sec, using 15.0GB of VRAM. The remaining 9.0GB of VRAM can be used for context, allowing for a practical context window of up to 16384 tokens, depending on the complexity of the input.

1. Install runtimeOllama

pip install ollama
ollama config set device cuda

2. Download the model

Download the Phi-4 Q8_0 quantized model (14.5GB file) from Hugging Face.

ollama pull bartowski/phi-4-GGUF:phi-4-Q8_0.gguf

3. Run it

ollama run phi-4-Q8_0.gguf --n-gpu-layers 128 --flash-attn --tensor-parallelism 2
ollama chat --model phi-4-Q8_0.gguf

4. Optimize for RTX 4090

For optimal performance on the NVIDIA GeForce RTX 4090 with 24GB VRAM, use --n-gpu-layers 128 to offload layers to the GPU, enable --flash-attn for faster attention computations, and set --tensor-parallelism 2 to utilize multiple GPU cores efficiently. This configuration ensures that the model runs smoothly within the 24GB VRAM limit, achieving ~57 tok/sec.

Troubleshooting

Out of memory error during inference

Reduce --n-gpu-layers to 64 or 32 and decrease --tensor-parallelism to 1.

Slow token generation speed

Ensure CUDA is properly installed and configured. Check if --flash-attn is enabled.

Model fails to load

Verify the integrity of the downloaded model file and try re-downloading it.

Alternative runtimes

Alternative runtimes like LM Studio, llama.cpp, and Jan can be used for specific use cases. LM Studio offers a more user-friendly interface for model deployment, llama.cpp provides more fine-grained control over quantization and performance tuning, and Jan is suitable for lightweight, low-resource environments. However, Ollama is recommended for its ease of use and compatibility with the NVIDIA GeForce RTX 4090.

Full Phi-4 details →

Other models that run great on RTX 4090

FAQ (20)

What GPU do I need to run Phi-4?

To run Phi-4, you need a GPU with at least 8.9 GB of VRAM, but 15.0 GB is recommended for optimal performance.

Is Phi-4 good for coding?

Yes, Phi-4 is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 16,384 tokens.

Phi-4 vs Llama 3.1 8B?

Phi-4 has 14 billion parameters compared to Llama 3.1's 8 billion, making it more powerful for complex tasks but requiring more VRAM.

Can I run Phi-4 on a Mac?

Yes, you can run Phi-4 on a Mac with a compatible GPU, such as an AMD or NVIDIA card with sufficient VRAM.

How much VRAM does Phi-4 need?

Phi-4 requires between 8.9 GB and 15.0 GB of VRAM, depending on the quantization level used.

Is Phi-4 censored?

Phi-4 is not inherently censored, but its outputs can be filtered based on the implementation and configuration settings.

Is Phi-4 commercial-use allowed?

Yes, Phi-4 is licensed under the MIT License, which allows for commercial use without restriction.

Phi-4 context length?

Phi-4 has a context length of 16,384 tokens, allowing it to handle longer sequences of text effectively.

Want personalized recommendations for your exact setup? Detect my hardware →