~/runthismodel
daemon okbuild 5a3c91d00:00:00Z

Can M3 Max run Qwen3 8B Base?

S

Yes — runs locally

~48 tok/sec · Fast — smooth conversation. Responses feel real-time.

Your VRAM
128 GB
Model size
8B
Best quant
BF16
VRAM needed
16.5 GB

The verdict

The M3 Max (128 GB VRAM) handles Qwen3 8B Base comfortably using the BF16 quantization, which fits in 16.5 GB. Expected throughput is around 48 tokens/second, which feels Fast — smooth conversation. Responses feel real-time. in interactive use. Official Qwen3 8B foundation model — pretrained only, no RLHF or refusal training. The 'naturally uncensored' option: no abliteration needed because alignment was never applied. Apache 2.0.

Setup tutorial: Qwen3 8B Base on M3 Max

AI-generated, GPU-specific. Verified commands for your exact hardware.

TL;DR

Run Qwen3 8B Base on an Apple M3 Max with Grade S performance, using the BF16 quantization for ~136 tok/sec.

Prerequisites

Before starting, ensure you have at least 20GB of free disk space, macOS Ventura 13.0 or later, and Xcode Command Line Tools installed. You can install Xcode CLT by running `xcode-select --install` in your terminal.

Expected performance

With the BF16 quantization, you can expect ~136 tok/sec performance, utilizing 16.5GB of VRAM. Given the 111.5GB of remaining VRAM, you can achieve a practical context window close to the maximum 32768 tokens, making it highly suitable for long-form text generation and complex tasks.

1. Install runtimeOllama (preferred on Apple Silicon)

brew install ollama
ollama init

2. Download the model

Download the BF16 quantized model (16.0GB file) from Hugging Face.

ollama pull Qwen/Qwen3-8B-Base

3. Run it

ollama run Qwen/Qwen3-8B-Base --device mps --quantization bf16
ollama chat

4. Optimize for M3 Max

For optimal performance on the Apple M3 Max with 128GB VRAM, use the Metal Performance Shaders (MPS) backend and the BF16 quantization. The unified memory architecture allows efficient memory management, ensuring that the 16.5GB VRAM required by the model is well-handled, leaving 111.5GB of VRAM for context and other tasks.

Troubleshooting

Error: MPS device not found

Ensure you have the latest macOS version and Xcode Command Line Tools installed. Run `xcode-select --install` and restart your terminal.

Low token generation speed

Check if the BF16 quantization is correctly applied. Re-run the `ollama pull` command to ensure the model is downloaded properly.

Out of memory errors

Reduce the batch size or context length to fit within the available VRAM. Adjust the `--context-length` parameter in the `ollama run` command.

Alternative runtimes

While Ollama is the preferred runtime for Apple Silicon, you can also consider LM Studio for a more graphical interface, llama.cpp for lightweight deployment, or MLX for custom optimizations. Jan is another option for advanced users who need fine-grained control over the inference process.

Other models that run great on M3 Max

FAQ (20)

What GPU do I need to run Qwen3 8B Base?

To run Qwen3 8B Base, you need a GPU with at least 5.3 GB of VRAM for the lowest quantization level, up to 16.5 GB for the highest. NVIDIA GPUs like the RTX 3060 or higher are recommended.

Is Qwen3 8B Base good for coding?

Qwen3 8B Base is suitable for coding tasks, offering strong natural language understanding and code generation capabilities, though it may not be as specialized as models trained specifically for coding.

Qwen3 8B Base vs Llama 3.1 8B?

Qwen3 8B Base has a larger context length (32,768 tokens) compared to Llama 3.1 8B, which typically has a shorter context length. Qwen3 8B Base also uses the Apache 2.0 license, making it more permissive for commercial use.

Can I run Qwen3 8B Base on a Mac?

Yes, you can run Qwen3 8B Base on a Mac, but you will need a Mac with an M1 or later chip and sufficient VRAM. You may also need to install additional software like Docker or a compatible GPU driver.

How much VRAM does Qwen3 8B Base need?

The VRAM requirement for Qwen3 8B Base ranges from 5.3 GB to 16.5 GB, depending on the quantization level used. Lower quantization levels require less VRAM but may have a slight impact on performance.

Is Qwen3 8B Base censored?

No, Qwen3 8B Base is not censored. It is a foundation model without alignment or refusal training, allowing for more natural and uncensored responses.

Is Qwen3 8B Base commercial-use allowed?

Yes, Qwen3 8B Base is licensed under Apache 2.0, which allows for commercial use, modification, and distribution without restrictions.

Qwen3 8B Base context length?

Qwen3 8B Base has a context length of 32,768 tokens, which is significantly longer than many other models, allowing for more extensive and coherent conversations.

Want personalized recommendations for your exact setup? Detect my hardware →