~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can M4 Pro run Phi-3.5 Mini 3.8B?

S

Yes — runs locally

~62 tok/sec · Instant — feels like typing. No noticeable delay.

Your VRAM
48 GB
Model size
3.8B
Best quant
Q8_0
VRAM needed
4.3 GB

The verdict

The M4 Pro (48 GB VRAM) handles Phi-3.5 Mini 3.8B comfortably using the Q8_0 quantization, which fits in 4.3 GB. Expected throughput is around 62 tokens/second, which feels Instant — feels like typing. No noticeable delay. in interactive use. Tiny but capable 3.8B model. Runs on almost any hardware including phones.

Setup tutorial: Phi-3.5 Mini 3.8B on M4 Pro

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Phi-3.5 Mini 3.8B on an Apple M4 Pro with Q8_0 quantization for Grade S performance at ~229 tok/sec.

Prerequisites

Before starting, ensure you have at least 10GB of free disk space, macOS 12.3 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.

Expected performance

With the Q8_0 quantization, you can expect the model to run at approximately 229 tokens per second, using around 4.3GB of VRAM. Given the 48GB of VRAM, you will have 43.7GB of headroom for context, allowing for a practical context window of up to 131,072 tokens.

1. Install runtimeOllama (preferred on Apple Silicon)

brew install ollama
ollama setup

2. Download the model

Download the Q8_0 quantized Phi-3.5 Mini 3.8B model (3.8GB file) from Hugging Face.

ollama pull bartowski/Phi-3.5-mini-instruct-GGUF:Phi-3.5-mini-instruct-Q8_0.gguf

3. Run it

ollama run Phi-3.5-mini-instruct-Q8_0.gguf
ollama chat

4. Optimize for M4 Pro

To optimize performance on the Apple M4 Pro, use the Metal/MLX backend to leverage the 48GB of unified memory. Ensure that MPS layers are enabled to take full advantage of the GPU's capabilities. The Q8_0 quantization is well-suited for the M4 Pro's architecture, providing a balance between speed and memory usage.

Troubleshooting

Low performance or high latency

Ensure that the Metal/MLX backend is enabled and that MPS layers are properly configured. Run `ollama config set backend metal` to set the backend.

Out of memory errors

Reduce the batch size or context length. You can adjust these settings in the Ollama configuration using `ollama config set batch_size <value>` and `ollama config set context_length <value>`. For example, `ollama config set context_length 65536`.

Model not found

Verify that the model was successfully downloaded and is available in the Ollama models directory. Run `ollama list` to check the available models.

Alternative runtimes

While Ollama is the preferred runtime for Apple Silicon, you can also use alternatives like LM Studio for a graphical interface, llama.cpp for more control over quantization, or MLX for direct Metal integration. Use these alternatives if you need specific features or better integration with other tools.

Other models that run great on M4 Pro

FAQ (20)

What GPU do I need to run Phi-3.5 Mini 3.8B?

Phi-3.5 Mini 3.8B requires a GPU with at least 2.7 GB of VRAM, but 4.3 GB is recommended for optimal performance.

Is Phi-3.5 Mini 3.8B good for coding?

Phi-3.5 Mini 3.8B is capable of generating code and providing coding assistance, but its performance is best suited for simpler tasks due to its 3.8B parameters.

Phi-3.5 Mini 3.8B vs Llama 3.1 8B?

Phi-3.5 Mini 3.8B has 3.8B parameters, making it smaller and more resource-efficient than Llama 3.1 8B, which has 8B parameters and requires more VRAM and computational power.

Can I run Phi-3.5 Mini 3.8B on a Mac?

Yes, Phi-3.5 Mini 3.8B can run on a Mac, provided your Mac has a compatible GPU with at least 2.7 GB of VRAM.

How much VRAM does Phi-3.5 Mini 3.8B need?

Phi-3.5 Mini 3.8B requires a minimum of 2.7 GB of VRAM, but 4.3 GB is recommended for better performance, depending on the quantization level.

Is Phi-3.5 Mini 3.8B censored?

Phi-3.5 Mini 3.8B is not inherently censored, but it may include content filters to prevent harmful or inappropriate content.

Is Phi-3.5 Mini 3.8B commercial-use allowed?

Yes, Phi-3.5 Mini 3.8B is licensed under the MIT License, which allows for commercial use.

Phi-3.5 Mini 3.8B context length?

Phi-3.5 Mini 3.8B supports a context length of 131,072 tokens, which is quite large and allows for extensive context in conversations and tasks.

Want personalized recommendations for your exact setup? Detect my hardware →