Can M4 Pro run Phi-3.5 MoE?

Yes — runs locally

~17 tok/sec · Good — slight pause, then text streams smoothly.

Your VRAM

48 GB

Model size

41.9B

Best quant

Q4_K_M

VRAM needed

24.1 GB

The verdict

The M4 Pro (48 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 17 tokens/second, which feels Good — slight pause, then text streams smoothly. in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.

Setup tutorial: Phi-3.5 MoE on M4 Pro

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Phi-3.5 MoE runs smoothly on the Apple M4 Pro with a Grade A performance, using the Q4_K_M quantization. Expect ~22 tokens per second with 24.1GB VRAM usage.

Prerequisites

Before starting, ensure you have at least 50GB of free disk space, macOS Ventura 13.0 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.

Expected performance

With the Q4_K_M quantization, you can expect a throughput of approximately 22 tokens per second, with 24.1GB of VRAM in use. Given the 48GB VRAM, you have a headroom of 23.9GB for context, allowing for a practical context window of up to 131,072 tokens.

1. Install runtimeOllama (preferred on Apple Silicon)

brew install ollama
ollama init

2. Download the model

Download the Q4_K_M quantized version of Phi-3.5 MoE, which is 23.6GB in size.

ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf

3. Run it

ollama run Phi-3.5-MoE-instruct-Q4_K_M
ollama chat --model Phi-3.5-MoE-instruct-Q4_K_M

4. Optimize for M4 Pro

For optimal performance on the Apple M4 Pro, leverage the Metal/MLX backend to utilize the 48GB of unified memory efficiently. Ensure that MPS (Metal Performance Shaders) layers are enabled to take advantage of the GPU's capabilities. With 48GB VRAM, you have ample headroom for large context windows and other tasks.

Troubleshooting

Insufficient VRAM to load the model

Ensure you have at least 48GB of free VRAM. If not, close other applications and try again.

Slow inference speed

Check if the Metal/MLX backend is enabled and MPS layers are utilized. You can also try reducing the context length to improve speed.

Model not found

Verify that the model was successfully downloaded and is available in the Ollama models directory. Run `ollama list` to check.

Alternative runtimes

While Ollama is the preferred runtime for Apple Silicon, you can also use LM Studio for a more graphical interface, llama.cpp for fine-grained control, or MLX for direct Metal integration. Choose an alternative if you need specific features or better integration with other tools.

Full Phi-3.5 MoE details →

Other models that run great on M4 Pro

FAQ (20)

What GPU do I need to run Phi-3.5 MoE?

To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.

Is Phi-3.5 MoE good for coding?

Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.

Phi-3.5 MoE vs Llama 3.1 8B?

Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.

Can I run Phi-3.5 MoE on a Mac?

Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.

How much VRAM does Phi-3.5 MoE need?

Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.

Is Phi-3.5 MoE censored?

Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.

Is Phi-3.5 MoE commercial-use allowed?

Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.

Phi-3.5 MoE context length?

Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.

Want personalized recommendations for your exact setup? Detect my hardware →