~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can RTX 5060 Ti run Phi-3.5 MoE?

D

Yes — runs locally

~0 tok/sec · Cannot run — insufficient VRAM

Your VRAM
16 GB
Model size
41.9B
Best quant
Q4_K_M
VRAM needed
24.1 GB

The verdict

The RTX 5060 Ti (16 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 0 tokens/second, which feels Cannot run — insufficient VRAM in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.

Setup tutorial: Phi-3.5 MoE on RTX 5060 Ti

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Phi-3.5 MoE on an NVIDIA GeForce RTX 5060 Ti with Q4_K_M quantization for a usable ~17 tok/sec performance, achieving a Grade D.

Prerequisites

Before starting, ensure you have at least 23.6GB of free disk space, a compatible operating system (Windows or Linux), and the latest NVIDIA drivers (version 526.95 or later) with CUDA 11.8 installed.

Expected performance

With the recommended settings, expect a performance of approximately 17 tok/sec, using 24.1GB of VRAM. Given the 16GB VRAM limit, the model will likely use swap memory, but you can still achieve a practical context window of around 8192 tokens with the remaining VRAM.

1. Install runtimeOllama

pip install ollama
ollama init

2. Download the model

Download the Q4_K_M quantized version of Phi-3.5 MoE from Hugging Face, which is a 23.6GB file.

ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf

3. Run it

ollama run Phi-3.5-MoE-instruct-Q4_K_M.gguf --n-gpu-layers 40 --flash-attn --tensor-parallelism 1

4. Optimize for RTX 5060 Ti

For optimal performance on the NVIDIA GeForce RTX 5060 Ti with 16GB VRAM, set --n-gpu-layers to 40 to utilize most of the VRAM while leaving some headroom. Enable --flash-attn for faster attention computation and set --tensor-parallelism to 1 to avoid overloading the GPU.

Troubleshooting

Out of memory errors during inference.

Reduce the number of --n-gpu-layers to 30 or lower and increase the batch size if possible.

Slow inference speed below 10 tok/sec.

Ensure that --flash-attn is enabled and check if the GPU drivers and CUDA are up to date.

Inference fails to start.

Verify that the model file is correctly downloaded and not corrupted by re-running the 'ollama pull' command.

Alternative runtimes

Consider using LM Studio for a more user-friendly interface, llama.cpp for better performance on CPU, or Jan for advanced features like multi-GPU support. Choose based on your specific needs and hardware constraints.

Other models that run great on RTX 5060 Ti

FAQ (20)

What GPU do I need to run Phi-3.5 MoE?

To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.

Is Phi-3.5 MoE good for coding?

Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.

Phi-3.5 MoE vs Llama 3.1 8B?

Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.

Can I run Phi-3.5 MoE on a Mac?

Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.

How much VRAM does Phi-3.5 MoE need?

Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.

Is Phi-3.5 MoE censored?

Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.

Is Phi-3.5 MoE commercial-use allowed?

Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.

Phi-3.5 MoE context length?

Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.

Want personalized recommendations for your exact setup? Detect my hardware →