Can M3 Max run Phi-3.5 MoE?
Yes — runs locally
~26 tok/sec · Good — slight pause, then text streams smoothly.
The verdict
The M3 Max (128 GB VRAM) handles Phi-3.5 MoE comfortably using the Q4_K_M quantization, which fits in 24.1 GB. Expected throughput is around 26 tokens/second, which feels Good — slight pause, then text streams smoothly. in interactive use. Microsoft MoE — 16 experts of 3.8 B, 6.6 B active per token. Strong reasoning at modest cost.
Setup tutorial: Phi-3.5 MoE on M3 Max
AI-generated, GPU-specific. Verified commands for your exact hardware.
Run Phi-3.5 MoE on an Apple M3 Max with Ollama using the Q4_K_M quantization. Expect Grade S performance at ~59 tok/sec.
Prerequisites
Before starting, ensure you have at least 23.6GB of free disk space, macOS Ventura 13.0 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.
Expected performance
With the Q4_K_M quantization, expect the model to run at approximately 59 tokens per second, utilizing 24.1GB of VRAM. Given the 128GB VRAM, you have 103.9GB of headroom, allowing for a practical context window of up to 131,072 tokens.
1. Install runtimeOllama (preferred on Apple Silicon)
brew install ollama
ollama init2. Download the model
Download the Phi-3.5 MoE model with Q4_K_M quantization (23.6GB file) from Hugging Face.
ollama pull bartowski/Phi-3.5-MoE-instruct-GGUF:Phi-3.5-MoE-instruct-Q4_K_M.gguf3. Run it
ollama run Phi-3.5-MoE-instruct-Q4_K_M.gguf
ollama chat --model Phi-3.5-MoE-instruct-Q4_K_M.gguf4. Optimize for M3 Max
To optimize performance on the Apple M3 Max, leverage the Metal/MLX backend and unified memory. The 128GB VRAM allows for efficient use of the 24.1GB required by the Q4_K_M quantization, leaving ample headroom for large context windows. Ensure MPS layers are enabled to take full advantage of the GPU's capabilities.
Troubleshooting
Model fails to load due to insufficient VRAM
Ensure you have at least 128GB of VRAM available. If not, consider using a lower quantization level or a smaller model.
Performance is below 59 tok/sec
Check that the Metal/MLX backend is enabled and that MPS layers are utilized. Run `ollama config set backend metal` to ensure the correct backend is set.
Unified memory issues
Restart your machine to clear any memory leaks. Ensure all unnecessary applications are closed to maximize available unified memory.
Alternative runtimes
While Ollama is the preferred runtime for Apple Silicon, you can also use LM Studio for a more graphical interface, llama.cpp for command-line flexibility, or MLX for custom optimizations. Jan is another option but may require additional setup for Apple M3 Max. Choose based on your specific needs and comfort level with the command line.
Other models that run great on M3 Max
FAQ (20)
What GPU do I need to run Phi-3.5 MoE?
To run Phi-3.5 MoE, you need a GPU with at least 24.1 GB of VRAM, such as an NVIDIA RTX 3090 or A6000.
Is Phi-3.5 MoE good for coding?
Phi-3.5 MoE is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 131,072 tokens.
Phi-3.5 MoE vs Llama 3.1 8B?
Phi-3.5 MoE has 41.9 billion parameters compared to Llama 3.1 8B's 8 billion, offering more sophisticated reasoning and context handling but requiring significantly more VRAM.
Can I run Phi-3.5 MoE on a Mac?
Yes, you can run Phi-3.5 MoE on a Mac with a compatible GPU that has at least 24.1 GB of VRAM, such as an eGPU setup.
How much VRAM does Phi-3.5 MoE need?
Phi-3.5 MoE requires 24.1 GB of VRAM, which is consistent across different quantization levels.
Is Phi-3.5 MoE censored?
Phi-3.5 MoE is not inherently censored, but its responses may be influenced by the training data and any filters applied during deployment.
Is Phi-3.5 MoE commercial-use allowed?
Yes, Phi-3.5 MoE is licensed under the MIT License, allowing for commercial use without additional restrictions.
Phi-3.5 MoE context length?
Phi-3.5 MoE has a context length of 131,072 tokens, which is significantly larger than many other models, enabling it to handle longer and more complex inputs.
Want personalized recommendations for your exact setup? Detect my hardware →