Can RTX 5080 run Phi-3.5 Mini 3.8B?

Yes — runs locally

~114 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM

16 GB

Model size

3.8B

Best quant

Q8_0

VRAM needed

4.3 GB

The verdict

The RTX 5080 (16 GB VRAM) handles Phi-3.5 Mini 3.8B comfortably using the Q8_0 quantization, which fits in 4.3 GB. Expected throughput is around 114 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Tiny but capable 3.8B model. Runs on almost any hardware including phones.

Setup tutorial: Phi-3.5 Mini 3.8B on RTX 5080

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

The Phi-3.5 Mini 3.8B model runs at Grade S on the NVIDIA GeForce RTX 5080 with Q8_0 quantization, achieving ~178 tok/sec.

Prerequisites

Before starting, ensure you have at least 4GB of free disk space, a compatible operating system (Windows or Linux), the latest NVIDIA drivers (version 525.60.13 or later), and CUDA 11.8 installed.

Expected performance

With Q8_0 quantization, the model should achieve ~178 tok/sec while using 4.3GB of VRAM, leaving 11.7GB of VRAM for context. This allows for a practical context window of up to 131072 tokens, maximizing the model's capabilities.

1. Install runtimeOllama

curl -L https://ollama.ai/install.sh | sh
ollama install

2. Download the model

Download the Q8_0 quantized version of Phi-3.5 Mini 3.8B (3.8GB file).

ollama pull bartowski/Phi-3.5-mini-instruct-GGUF:Phi-3.5-mini-instruct-Q8_0.gguf

3. Run it

ollama run Phi-3.5-mini-instruct-Q8_0.gguf --n-gpu-layers 38 --flash-attn
ollama chat Phi-3.5-mini-instruct-Q8_0.gguf

4. Optimize for RTX 5080

For optimal performance on the NVIDIA GeForce RTX 5080 with 16GB VRAM, use --n-gpu-layers 38 to offload layers to the GPU, enabling flash attention (--flash-attn) for faster inference. Tensor parallelism is not necessary due to the model size and available VRAM.

Troubleshooting

Out of memory errors during inference

Reduce --n-gpu-layers to 28 or lower to decrease VRAM usage.

Slow inference speed

Ensure flash attention is enabled with --flash-attn and that the CUDA toolkit is correctly installed.

Model fails to load

Verify the model file integrity with 'ollama verify Phi-3.5-mini-instruct-Q8_0.gguf' and reinstall if necessary.

Alternative runtimes

For users preferring different runtimes, consider LM Studio for a graphical interface, llama.cpp for more control over quantization, or Jan for web-based deployment. Ollama is recommended for its ease of use and performance on the NVIDIA GeForce RTX 5080.

Full Phi-3.5 Mini 3.8B details →

Other models that run great on RTX 5080

FAQ (20)

What GPU do I need to run Phi-3.5 Mini 3.8B?

Phi-3.5 Mini 3.8B requires a GPU with at least 2.7 GB of VRAM, but 4.3 GB is recommended for optimal performance.

Is Phi-3.5 Mini 3.8B good for coding?

Phi-3.5 Mini 3.8B is capable of generating code and providing coding assistance, but its performance is best suited for simpler tasks due to its 3.8B parameters.

Phi-3.5 Mini 3.8B vs Llama 3.1 8B?

Phi-3.5 Mini 3.8B has 3.8B parameters, making it smaller and more resource-efficient than Llama 3.1 8B, which has 8B parameters and requires more VRAM and computational power.

Can I run Phi-3.5 Mini 3.8B on a Mac?

Yes, Phi-3.5 Mini 3.8B can run on a Mac, provided your Mac has a compatible GPU with at least 2.7 GB of VRAM.

How much VRAM does Phi-3.5 Mini 3.8B need?

Phi-3.5 Mini 3.8B requires a minimum of 2.7 GB of VRAM, but 4.3 GB is recommended for better performance, depending on the quantization level.

Is Phi-3.5 Mini 3.8B censored?

Phi-3.5 Mini 3.8B is not inherently censored, but it may include content filters to prevent harmful or inappropriate content.

Is Phi-3.5 Mini 3.8B commercial-use allowed?

Yes, Phi-3.5 Mini 3.8B is licensed under the MIT License, which allows for commercial use.

Phi-3.5 Mini 3.8B context length?

Phi-3.5 Mini 3.8B supports a context length of 131,072 tokens, which is quite large and allows for extensive context in conversations and tasks.

Want personalized recommendations for your exact setup? Detect my hardware →