~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can RTX 5090 run Phi-3.5 MoE?

A

Yes — runs locally

~42 tok/sec · Fast — smooth conversation. Responses feel real-time.

Your VRAM
32 GB
Model size
41.9B
Best quant
Q4_K_M
VRAM needed
24.1 GB

The verdict

The RTX 5090 (32 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 42 tokens/second, which feels Fast — smooth conversation. Responses feel real-time. in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.

Setup tutorial: Phi-3.5 MoE on RTX 5090

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

The Phi-3.5 MoE model runs exceptionally well on the NVIDIA GeForce RTX 5090 with a Grade A performance, using the Q4_K_M quantization, achieving ~34 tokens per second.

Prerequisites

Before starting, ensure you have at least 25GB of free disk space, a compatible operating system (Windows or Linux), and the latest NVIDIA drivers (version 525.60.11 or later) with CUDA 11.8 installed.

Expected performance

With the Q4_K_M quantization, you can expect the model to run at approximately 34 tokens per second, using around 24.1GB of VRAM. This leaves about 7.9GB of VRAM for context, allowing for a practical context window of up to 131,072 tokens, depending on the complexity of the input.

1. Install runtimeOllama

pip install ollama
ollama config set cuda_path /usr/local/cuda

2. Download the model

Download the Phi-3.5 MoE model with Q4_K_M quantization (23.6GB file) from Hugging Face.

ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf

3. Run it

ollama run Phi-3.5-MoE-instruct-Q4_K_M --n-gpu-layers 128 --flash-attn --tensor-parallelism 2
ollama chat Phi-3.5-MoE-instruct-Q4_K_M

4. Optimize for RTX 5090

For optimal performance on the NVIDIA GeForce RTX 5090 with 32GB VRAM, set --n-gpu-layers to 128 to fully utilize the GPU memory. Enable --flash-attn for faster attention computations and set --tensor-parallelism to 2 to distribute the workload efficiently across the GPU cores.

Troubleshooting

Out of memory errors during inference

Reduce the number of --n-gpu-layers or decrease --tensor-parallelism to 1.

Slow inference speed

Ensure that --flash-attn is enabled and check if the CUDA installation is correct.

Model fails to load

Verify the integrity of the downloaded model file and try re-downloading it.

Alternative runtimes

Alternative runtimes like LM Studio, llama.cpp, and Jan can be used for specific use cases. LM Studio offers a more user-friendly interface and is suitable for those who prefer a GUI. llama.cpp is ideal for low-memory systems but may not achieve the same performance as Ollama. Jan is a lightweight runtime that can be useful for quick prototyping and testing.

Other models that run great on RTX 5090

FAQ (20)

What GPU do I need to run Phi-3.5 MoE?

To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.

Is Phi-3.5 MoE good for coding?

Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.

Phi-3.5 MoE vs Llama 3.1 8B?

Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.

Can I run Phi-3.5 MoE on a Mac?

Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.

How much VRAM does Phi-3.5 MoE need?

Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.

Is Phi-3.5 MoE censored?

Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.

Is Phi-3.5 MoE commercial-use allowed?

Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.

Phi-3.5 MoE context length?

Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.

Want personalized recommendations for your exact setup? Detect my hardware →