Can M4 Max run Qwen3 8B Base?

Yes — runs locally

~48 tok/sec · Fast — smooth conversation. Responses feel real-time.

Your VRAM

128 GB

Model size

Best quant

BF16

VRAM needed

16.5 GB

The verdict

The M4 Max (128 GB VRAM) handles Qwen3 8B Base comfortably using the BF16 quantization, which fits in 16.5 GB. Expected throughput is around 48 tokens/second, which feels Fast — smooth conversation. Responses feel real-time. in interactive use. Official Qwen3 8B foundation model — pretrained only, no RLHF or refusal training. The 'naturally uncensored' option: no abliteration needed because alignment was never applied. Apache 2.0.

Setup tutorial: Qwen3 8B Base on M4 Max

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Qwen3 8B Base on an Apple M4 Max with Grade S performance at ~136 tok/sec using the BF16 quantization. Requires 16.5GB VRAM and 16.0GB disk space.

Prerequisites

Before starting, ensure you have at least 16.0GB of free disk space, macOS 12.3 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.

Expected performance

You can expect ~136 tok/sec performance with 16.5GB VRAM in use. Given the remaining 111.5GB of VRAM, you can achieve a practical context window close to the maximum of 32768 tokens.

1. Install runtimeOllama (preferred on Apple Silicon)

brew install ollama
ollama init

2. Download the model

Download the BF16 quantized model (16.0GB file) from Hugging Face.

ollama pull Qwen/Qwen3-8B-Base

3. Run it

ollama run Qwen/Qwen3-8B-Base --device mps --quantization bf16
ollama chat

4. Optimize for M4 Max

For optimal performance on the Apple M4 Max with 128GB VRAM, use the Metal Performance Shaders (MPS) layer and the BF16 quantization. The unified memory architecture allows efficient data transfer between CPU and GPU. With 16.5GB VRAM used by the model, you have 111.5GB of headroom for context and other tasks.

Troubleshooting

Model fails to load due to insufficient VRAM

Ensure you have at least 128GB VRAM and try reducing the context window if necessary.

Slow performance or high CPU usage

Check if the MPS layer is enabled and ensure you are using the BF16 quantization.

Ollama initialization fails

Re-run `ollama init` and ensure Xcode Command Line Tools are installed.

Alternative runtimes

While Ollama is the preferred runtime for Apple Silicon, you can also use LM Studio for a GUI-based interface, llama.cpp for more control over quantization, or MLX for direct Metal integration. Jan is another lightweight option but may not offer the same performance benefits as Ollama on the Apple M4 Max.

Full Qwen3 8B Base details →

Other models that run great on M4 Max

FAQ (20)

What GPU do I need to run Qwen3 8B Base?

To run Qwen3 8B Base, you need a GPU with at least 5.3 GB of VRAM for the lowest quantization level, up to 16.5 GB for the highest. NVIDIA GPUs like the RTX 3060 or higher are recommended.

Is Qwen3 8B Base good for coding?

Qwen3 8B Base is suitable for coding tasks, offering strong natural language understanding and code generation capabilities, though it may not be as specialized as models trained specifically for coding.

Qwen3 8B Base vs Llama 3.1 8B?

Qwen3 8B Base has a larger context length (32,768 tokens) compared to Llama 3.1 8B, which typically has a shorter context length. Qwen3 8B Base also uses the Apache 2.0 license, making it more permissive for commercial use.

Can I run Qwen3 8B Base on a Mac?

Yes, you can run Qwen3 8B Base on a Mac, but you will need a Mac with an M1 or later chip and sufficient VRAM. You may also need to install additional software like Docker or a compatible GPU driver.

How much VRAM does Qwen3 8B Base need?

The VRAM requirement for Qwen3 8B Base ranges from 5.3 GB to 16.5 GB, depending on the quantization level used. Lower quantization levels require less VRAM but may have a slight impact on performance.

Is Qwen3 8B Base censored?

No, Qwen3 8B Base is not censored. It is a foundation model without alignment or refusal training, allowing for more natural and uncensored responses.

Is Qwen3 8B Base commercial-use allowed?

Yes, Qwen3 8B Base is licensed under Apache 2.0, which allows for commercial use, modification, and distribution without restrictions.

Qwen3 8B Base context length?

Qwen3 8B Base has a context length of 32,768 tokens, which is significantly longer than many other models, allowing for more extensive and coherent conversations.

Want personalized recommendations for your exact setup? Detect my hardware →