Can RTX 3090 Ti run Phi-3.5 MoE?

Yes — runs locally

~18 tok/sec · Good — slight pause, then text streams smoothly.

Your VRAM

24 GB

Model size

41.9B

Best quant

Q4_K_M

VRAM needed

24.1 GB

The verdict

The RTX 3090 Ti (24 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 18 tokens/second, which feels Good — slight pause, then text streams smoothly. in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.

Setup tutorial: Phi-3.5 MoE on RTX 3090 Ti

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

The Phi-3.5 MoE model runs on an NVIDIA GeForce RTX 3090 Ti with a grade C performance, using the Q4_K_M quantization, achieving approximately 26 tokens per second.

Prerequisites

Before starting, ensure you have at least 50GB of free disk space, a 64-bit version of Windows or Linux, NVIDIA driver version 470.82.01 or later, and CUDA 11.4 or later installed.

Expected performance

With the specified configuration, you can expect the model to run at approximately 26 tokens per second, utilizing around 24.1GB of VRAM. This leaves about -0.1GB of VRAM for additional context, allowing for a practical context window of up to 131,072 tokens given the remaining VRAM.

1. Install runtimeOllama

pip install ollama
ollama init

2. Download the model

Download the Phi-3.5 MoE model with Q4_K_M quantization (23.6GB file) from the Hugging Face repository.

ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf

3. Run it

ollama run Phi-3.5-MoE-instruct-Q4_K_M.gguf --n-gpu-layers 28 --flash-attn --tensor-parallelism 1

4. Optimize for RTX 3090 Ti

For optimal performance on the NVIDIA GeForce RTX 3090 Ti with 24GB VRAM, set --n-gpu-layers to 28 to utilize most of the GPU memory while leaving some headroom. Enable --flash-attn for faster attention computation and set --tensor-parallelism to 1 for single-GPU operation. This configuration should allow you to achieve the target performance of ~26 tok/sec.

Troubleshooting

Out of memory errors during inference

Reduce --n-gpu-layers to 24 or lower to free up more VRAM.

Slow inference speed

Ensure --flash-attn is enabled and try increasing --tensor-parallelism to 2 if you have a multi-GPU setup.

Model fails to load

Check if the model file is corrupted and re-download it using the 'ollama pull' command.

Alternative runtimes

Alternatively, you can use LM Studio for a more user-friendly interface, llama.cpp for advanced customization options, or Jan for better performance on smaller models. Choose Ollama for its ease of use and compatibility with large models like Phi-3.5 MoE.

Full Phi-3.5 MoE details →

Other models that run great on RTX 3090 Ti

FAQ (20)

What GPU do I need to run Phi-3.5 MoE?

To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.

Is Phi-3.5 MoE good for coding?

Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.

Phi-3.5 MoE vs Llama 3.1 8B?

Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.

Can I run Phi-3.5 MoE on a Mac?

Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.

How much VRAM does Phi-3.5 MoE need?

Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.

Is Phi-3.5 MoE censored?

Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.

Is Phi-3.5 MoE commercial-use allowed?

Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.

Phi-3.5 MoE context length?

Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.

Want personalized recommendations for your exact setup? Detect my hardware →