~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/phi-3.5-moe-instruct
Microsoft · llm
Phi-3.5 MoE
Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.
41.9b paramsphimoemit128K ctx24.1124.11 GB vramMoE
about·model card

Phi-3.5 MoE from Microsoft is a 16-expert Mixture-of-Experts variant of Phi-3.5. Hits MMLU 78.9 with only 6.6 B parameters firing per token, which is remarkable. The 26 GB VRAM bar at Q4 puts it just above the consumer 24 GB sweet spot — comfortable on RTX A6000 or 32 GB+ Apple Silicon.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.523.605 GB24.11 GB24.61 GB
85%

Context window & KV cache

Adds 2.50 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Phi-3.5 MoE

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/Phi-3.5-MoE-instruct-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Phi-3.5 MoE on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Phi-3.5 MoEfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

26.2 GB

24.1 GB weights + 1.6 GB KV

Aggregate tok/s

28

across 1 user

Per-user tok/s

28

MoE active params

⚠ Will spill 2.2 GB of weights to system RAM (~5× slower per offloaded layer). Use llama.cpp’s --cpu-offload-gb or vLLM’s --swap-space.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Phi-3.5 MoE?

Phi-3.5 MoE requires 24.11 GB VRAM minimum with Q4_K_M quantization. For full precision you need 24.11 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Phi-3.5 MoE?

To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.

Is Phi-3.5 MoE good for coding?

Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.

Phi-3.5 MoE vs Llama 3.1 8B?

Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.

Can I run Phi-3.5 MoE on a Mac?

Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.

How much VRAM does Phi-3.5 MoE need?

Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.

Is Phi-3.5 MoE censored?

Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.

Is Phi-3.5 MoE commercial-use allowed?

Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.

Phi-3.5 MoE context length?

Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.

Does Phi-3.5 MoE support function calling?

Phi-3.5 MoE supports function calling, allowing it to interact with external systems and APIs for enhanced functionality.

Phi-3.5 MoE quantization options?

Phi-3.5 MoE supports various quantization options, including 8-bit and 4-bit, to reduce memory usage while maintaining performance.

Can Phi-3.5 MoE run on CPU?

While Phi-3.5 MoE can technically run on a CPU, it is highly inefficient and not recommended due to the model's size and computational requirements.

Phi-3.5 MoE fine-tuning?

Phi-3.5 MoE can be fine-tuned on specific datasets to improve performance in particular domains or tasks, though this requires significant computational resources.

Phi-3.5 MoE system requirements?

Phi-3.5 MoE requires a powerful GPU with at least 24.1 GB of VRAM, 64 GB of RAM, and a multi-core CPU to run efficiently.

Phi-3.5 MoE performance benchmark?

Performance benchmarks for Phi-3.5 MoE show it can process around 10-20 tokens per second on a high-end GPU like the NVIDIA A100, depending on the specific task and quantization level.

Phi-3.5 MoE for RAG?

Phi-3.5 MoE is suitable for Retrieval-Augmented Generation (RAG) tasks due to its large context length and strong reasoning capabilities, making it effective for integrating external information.

Phi-3.5 MoE for agents?

Phi-3.5 MoE can be used to create intelligent agents that require advanced natural language understanding and reasoning, thanks to its large model size and context length.

Phi-3.5 MoE for coding vs general?

Phi-3.5 MoE excels in both coding and general tasks, but its large context length and strong reasoning make it particularly well-suited for complex coding scenarios.

Phi-3.5 MoE vs ChatGPT?

Phi-3.5 MoE has a larger context length (131,072 tokens) and more parameters (41.9B) compared to ChatGPT, potentially offering better performance in tasks requiring extensive context and reasoning.

Phi-3.5 MoE download size?

The download size for Phi-3.5 MoE varies depending on the quantization level, but it typically ranges from 15 GB to 30 GB.

Best quant for Phi-3.5 MoE?

The best quantization for Phi-3.5 MoE depends on your specific needs, but 8-bit quantization is often a good balance between performance and memory efficiency.