Can RTX 3080 Ti run Phi-4?

Yes — runs locally

~0 tok/sec · Cannot run — model too large for this GPU

Your VRAM

12 GB

Model size

14B

Best quant

Q4_K_M

VRAM needed

8.9 GB

The verdict

The RTX 3080 Ti (12 GB VRAM) handles Phi-4 comfortably using the Q4_K_M quantization, which fits in 8.9 GB. Expected throughput is around 0 tokens/second, which feels Cannot run — model too large for this GPU in interactive use. Microsoft's 14B parameter model. Punches well above its weight on reasoning.

Setup tutorial: Phi-4 on RTX 3080 Ti

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Phi-4 runs exceptionally well on the NVIDIA GeForce RTX 3080 Ti with a Grade A performance, using the Q4_K_M quantization. Expect around 48 tokens per second with snappy responsiveness.

Prerequisites

Before starting, ensure you have at least 20GB of free disk space, a 64-bit version of Windows or Linux, and the latest NVIDIA drivers (version 515.65 or later) installed. Additionally, CUDA 11.7 or later is required for optimal performance.

Expected performance

With the Q4_K_M quantization, expect Phi-4 to run at approximately 48 tokens per second, utilizing around 8.9GB of VRAM. This leaves 3.1GB of VRAM for the context, allowing you to handle large inputs and maintain a high level of responsiveness.

1. Install runtimeOllama

pip install ollama
ollama init

2. Download the model

Download the Phi-4 model with Q4_K_M quantization (8.4GB file) from the Hugging Face repository.

ollama pull bartowski/phi-4-GGUF:phi-4-Q4_K_M.gguf

3. Run it

ollama run phi-4-Q4_K_M --n-gpu-layers 12 --flash-attn
ollama chat phi-4-Q4_K_M

4. Optimize for RTX 3080 Ti

For optimal performance on the NVIDIA GeForce RTX 3080 Ti with 12GB VRAM, set --n-gpu-layers to 12 to utilize the maximum available VRAM. Enable --flash-attn for faster and more efficient attention calculations. Given the 12GB VRAM, you can achieve a practical context window of up to 16384 tokens, with 3.1GB of VRAM reserved for context.

Troubleshooting

Out of memory errors during inference

Reduce the number of GPU layers with --n-gpu-layers 8 or lower.

Slow token generation

Ensure --flash-attn is enabled to speed up attention calculations.

Model not loading

Check that the model file is correctly downloaded and not corrupted. Try re-downloading the model.

Alternative runtimes

While Ollama is the recommended runtime for this setup, you can also consider LM Studio for a more user-friendly interface, or llama.cpp for more advanced customization options. Jan is another lightweight option but may not offer the same level of performance optimization for the NVIDIA GeForce RTX 3080 Ti.

Full Phi-4 details →

Other models that run great on RTX 3080 Ti

FAQ (20)

What GPU do I need to run Phi-4?

To run Phi-4, you need a GPU with at least 8.9 GB of VRAM, but 15.0 GB is recommended for optimal performance.

Is Phi-4 good for coding?

Yes, Phi-4 is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 16,384 tokens.

Phi-4 vs Llama 3.1 8B?

Phi-4 has 14 billion parameters compared to Llama 3.1's 8 billion, making it more powerful for complex tasks but requiring more VRAM.

Can I run Phi-4 on a Mac?

Yes, you can run Phi-4 on a Mac with a compatible GPU, such as an AMD or NVIDIA card with sufficient VRAM.

How much VRAM does Phi-4 need?

Phi-4 requires between 8.9 GB and 15.0 GB of VRAM, depending on the quantization level used.

Is Phi-4 censored?

Phi-4 is not inherently censored, but its outputs can be filtered based on the implementation and configuration settings.

Is Phi-4 commercial-use allowed?

Yes, Phi-4 is licensed under the MIT License, which allows for commercial use without restriction.

Phi-4 context length?

Phi-4 has a context length of 16,384 tokens, allowing it to handle longer sequences of text effectively.

Want personalized recommendations for your exact setup? Detect my hardware →