~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can M3 Max run Mixtral 8x7B Instruct?

S

Yes — runs locally

~26 tok/sec · Good — slight pause, then text streams smoothly.

Your VRAM
128 GB
Model size
46.7B
Best quant
Q5_K_M
VRAM needed
30.5 GB

The verdict

The M3 Max (128 GB VRAM) handles Mixtral 8x7B Instruct comfortably using the Q5_K_M quantization, which fits in 30.5 GB. Expected throughput is around 26 tokens/second, which feels Good — slight pause, then text streams smoothly. in interactive use. The OG public MoE — 8 experts, 2 active per token, 47 B total / 13 B active. Apache-2.0.

Setup tutorial: Mixtral 8x7B Instruct on M3 Max

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Mixtral 8x7B Instruct on an Apple M3 Max with Grade S performance, using the Q5_K_M quantization for ~44 tok/sec.

Prerequisites

Before starting, ensure you have at least 128GB of free disk space, macOS Ventura 13.0 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.

Expected performance

With the Q5_K_M quantization, you can expect the model to run at approximately 44 tokens per second, utilizing around 30.5GB of VRAM. Given the 128GB VRAM, you have a headroom of 97.5GB, which allows for a practical context window of up to 32,768 tokens, making it suitable for long-form text generation tasks.

1. Install runtimeOllama (preferred on Apple Silicon)

brew install ollama
ollama init

2. Download the model

Download the Q5_K_M quantized Mixtral 8x7B Instruct model (30.0GB file) from Hugging Face.

ollama pull TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF:mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

3. Run it

ollama run TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF:mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
ollama chat

4. Optimize for M3 Max

To optimize performance on the Apple M3 Max, ensure that you are using the Metal/MLX backend to leverage the 128GB of unified memory. This will allow the model to efficiently utilize both CPU and GPU resources, minimizing bottlenecks. The 128GB VRAM provides ample headroom for large context windows, ensuring smooth and fast inference.

Troubleshooting

Model fails to load due to insufficient VRAM

Ensure you have at least 128GB of free VRAM. If not, close other applications to free up memory.

Slow inference speed

Check if the Metal/MLX backend is enabled. You can verify this by running `ollama info` and ensuring the correct backend is listed.

Inference crashes or hangs

Restart the Ollama runtime with `ollama stop` followed by `ollama start`. Ensure all dependencies are up to date with `brew update && brew upgrade`.

Alternative runtimes

While Ollama is the preferred runtime for Apple Silicon, you can also consider alternatives like LM Studio, llama.cpp, and MLX. LM Studio offers a graphical interface and is useful for users who prefer a GUI. llama.cpp is a lightweight option for command-line users, while MLX provides additional flexibility for custom model deployments. Choose the runtime based on your specific needs and preferences.

Other models that run great on M3 Max

FAQ (20)

What GPU do I need to run Mixtral 8x7B Instruct?

To run Mixtral 8x7B Instruct, you need a GPU with at least 25.1 GB of VRAM, but 30.5 GB is recommended for optimal performance.

Is Mixtral 8x7B Instruct good for coding?

Mixtral 8x7B Instruct is well-suited for coding tasks due to its large context length of 32,768 tokens and strong language understanding capabilities.

Mixtral 8x7B Instruct vs Llama 3.1 8B?

Mixtral 8x7B Instruct has more parameters (46.7B vs 8B) and a longer context length (32,768 vs 2,048), making it more powerful for complex tasks but requiring more VRAM.

Can I run Mixtral 8x7B Instruct on a Mac?

Yes, you can run Mixtral 8x7B Instruct on a Mac, but you will need a Mac with an M1 or later chip and sufficient VRAM to handle the model's requirements.

How much VRAM does Mixtral 8x7B Instruct need?

Mixtral 8x7B Instruct requires between 25.1 GB and 30.5 GB of VRAM, depending on the quantization level used.

Is Mixtral 8x7B Instruct censored?

No, Mixtral 8x7B Instruct is not censored; it provides uncensored responses based on the input it receives.

Is Mixtral 8x7B Instruct commercial-use allowed?

Yes, Mixtral 8x7B Instruct is licensed under the Apache-2.0 license, which allows for commercial use.

Mixtral 8x7B Instruct context length?

The context length of Mixtral 8x7B Instruct is 32,768 tokens, allowing it to handle very long inputs and maintain context over extended conversations.

Want personalized recommendations for your exact setup? Detect my hardware →