Phi-3.5 MoE from Microsoft is a 16-expert Mixture-of-Experts variant of Phi-3.5. Hits MMLU 78.9 with only 6.6 B parameters firing per token, which is remarkable. The 26 GB VRAM bar at Q4 puts it just above the consumer 24 GB sweet spot — comfortable on RTX A6000 or 32 GB+ Apple Silicon.
| Quantization | Bits | File Size | VRAM Needed | RAM Needed | Quality |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 23.605 GB | 24.11 GB | 24.61 GB | 85% |
Context window & KV cache
Adds 2.50 GB to VRAMLong chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.
Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.
How to run Phi-3.5 MoE
Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.
GUI. Browse → download → chat. MLX on Apple Silicon.
LM Studio home →- 1
Open LM Studio
Go to the 🔍 Search tab.
- 2
Search for
bartowski/Phi-3.5-MoE-instruct-GGUF - 3
Download
Pick the Q4_K_M quant — best balance of size vs. quality.
- 4
Chat
Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.
Community benchmarks
Real tokens/sec reports from people running Phi-3.5 MoE on actual hardware.
No community runs yet for this model. Be the first to submit your numbers.
Self-host serving plan
Want to host Phi-3.5 MoEfor many users? Or run it on a card that’s technically too small? Slide the knobs.
VRAM needed
26.2 GB
24.1 GB weights + 1.6 GB KV
Aggregate tok/s
28
across 1 user
Per-user tok/s
28
MoE active params
⚠ Will spill 2.2 GB of weights to system RAM (~5× slower per offloaded layer). Use llama.cpp’s --cpu-offload-gb or vLLM’s --swap-space.
Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.
See It In Action
Real model outputs generated via RunThisModel.com — watch responses stream in real time.
Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.
how much VRAM do I need to run Phi-3.5 MoE?
Phi-3.5 MoE requires 24.11 GB VRAM minimum with Q4_K_M quantization. For full precision you need 24.11 GB.
which quant should I pick?
Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.
What GPU do I need to run Phi-3.5 MoE?
To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.
Is Phi-3.5 MoE good for coding?
Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.
Phi-3.5 MoE vs Llama 3.1 8B?
Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.
Can I run Phi-3.5 MoE on a Mac?
Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.
How much VRAM does Phi-3.5 MoE need?
Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.
Is Phi-3.5 MoE censored?
Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.
Is Phi-3.5 MoE commercial-use allowed?
Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.
Phi-3.5 MoE context length?
Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.
Does Phi-3.5 MoE support function calling?
Phi-3.5 MoE supports function calling, allowing it to interact with external systems and APIs for enhanced functionality.
Phi-3.5 MoE quantization options?
Phi-3.5 MoE supports various quantization options, including 8-bit and 4-bit, to reduce memory usage while maintaining performance.
Can Phi-3.5 MoE run on CPU?
While Phi-3.5 MoE can technically run on a CPU, it is highly inefficient and not recommended due to the model's size and computational requirements.
Phi-3.5 MoE fine-tuning?
Phi-3.5 MoE can be fine-tuned on specific datasets to improve performance in particular domains or tasks, though this requires significant computational resources.
Phi-3.5 MoE system requirements?
Phi-3.5 MoE requires a powerful GPU with at least 24.1 GB of VRAM, 64 GB of RAM, and a multi-core CPU to run efficiently.
Phi-3.5 MoE performance benchmark?
Performance benchmarks for Phi-3.5 MoE show it can process around 10-20 tokens per second on a high-end GPU like the NVIDIA A100, depending on the specific task and quantization level.
Phi-3.5 MoE for RAG?
Phi-3.5 MoE is suitable for Retrieval-Augmented Generation (RAG) tasks due to its large context length and strong reasoning capabilities, making it effective for integrating external information.
Phi-3.5 MoE for agents?
Phi-3.5 MoE can be used to create intelligent agents that require advanced natural language understanding and reasoning, thanks to its large model size and context length.
Phi-3.5 MoE for coding vs general?
Phi-3.5 MoE excels in both coding and general tasks, but its large context length and strong reasoning make it particularly well-suited for complex coding scenarios.
Phi-3.5 MoE vs ChatGPT?
Phi-3.5 MoE has a larger context length (131,072 tokens) and more parameters (41.9B) compared to ChatGPT, potentially offering better performance in tasks requiring extensive context and reasoning.
Phi-3.5 MoE download size?
The download size for Phi-3.5 MoE varies depending on the quantization level, but it typically ranges from 15 GB to 30 GB.
Best quant for Phi-3.5 MoE?
The best quantization for Phi-3.5 MoE depends on your specific needs, but 8-bit quantization is often a good balance between performance and memory efficiency.