Can RTX 4080 SUPER run Phi-3.5 MoE?
Yes — runs locally
~0 tok/sec · Cannot run — insufficient VRAM
The verdict
The RTX 4080 SUPER (16 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 0 tokens/second, which feels Cannot run — insufficient VRAM in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.
Setup tutorial: Phi-3.5 MoE on RTX 4080 SUPER
AI-generated, GPU-specific. Verified commands for your exact hardware.
Phi-3.5 MoE runs on the NVIDIA GeForce RTX 4080 SUPER with a grade D, using the Q4_K_M quantization. Expect ~17 tok/sec performance.
Prerequisites
Before starting, ensure you have at least 25GB of free disk space, a compatible operating system (Windows or Linux), and the latest NVIDIA drivers (version 525.60.13 or later) with CUDA 11.8 installed.
Expected performance
With the Q4_K_M quantization, you can expect the model to run at approximately 17 tokens per second, consuming 24.1GB of VRAM. The remaining -8.1GB of VRAM will limit the practical context window to around 100,000 tokens, which is still quite large for most applications.
1. Install runtimeOllama
pip install ollama
ollama init2. Download the model
Download the Q4_K_M quantized model (23.6GB) from the Hugging Face repository.
ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf3. Run it
ollama run Phi-3.5-MoE-instruct-Q4_K_M --n-gpu-layers 16 --flash-attn
ollama chat Phi-3.5-MoE-instruct-Q4_K_M4. Optimize for RTX 4080 SUPER
For optimal performance on the NVIDIA GeForce RTX 4080 SUPER with 16GB VRAM, set --n-gpu-layers to 16 to utilize the full GPU memory. Enable --flash-attn to speed up attention calculations. Given the 24.1GB VRAM requirement, you will have approximately -8.1GB of VRAM headroom, which limits the practical context window to around 100,000 tokens.
Troubleshooting
Out of memory error during inference
Reduce the number of --n-gpu-layers to 12 or 8 to lower VRAM usage.
Slow inference speed
Ensure --flash-attn is enabled and update your NVIDIA drivers to the latest version.
Model fails to load
Verify that the model file is fully downloaded and not corrupted. Re-run the download command if necessary.
Alternative runtimes
For users preferring different runtimes, consider LM Studio for a more user-friendly interface, llama.cpp for fine-grained control over optimizations, or Jan for cloud-based inference. Each runtime has its own strengths, but Ollama provides a good balance of ease of use and performance for this GPU.
Other models that run great on RTX 4080 SUPER
FAQ (20)
What GPU do I need to run Phi-3.5 MoE?
To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.
Is Phi-3.5 MoE good for coding?
Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.
Phi-3.5 MoE vs Llama 3.1 8B?
Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.
Can I run Phi-3.5 MoE on a Mac?
Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.
How much VRAM does Phi-3.5 MoE need?
Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.
Is Phi-3.5 MoE censored?
Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.
Is Phi-3.5 MoE commercial-use allowed?
Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.
Phi-3.5 MoE context length?
Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.
Want personalized recommendations for your exact setup? Detect my hardware →