~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/gemma-3-moe-9b
Google · llm
Gemma 3 MoE 9B
Gemma 3 MoE variant. 9 B total, 2.5 B active. Strong fit for 12 GB cards.
9b paramsgemma3-moegemma8K ctx77 GB vramMoE
about·model card

Gemma 3 MoE 9B is Google take on the open MoE recipe. 9 B total / 2.5 B active makes it the natural step-up from Gemma 3 4B for users with 12 GB cards. Same Gemma license terms apply, so commercial use is permitted with attribution but not unrestricted.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.55.5 GB7 GB10 GB
85%

Context window & KV cache

Adds 1.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Gemma 3 MoE 9B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/gemma-3-moe-9b-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Gemma 3 MoE 9B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Gemma 3 MoE 9Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

8.3 GB

7.0 GB weights + 0.8 GB KV

Aggregate tok/s

100

across 1 user

Per-user tok/s

100

MoE active params

✅ Fits in 24 GB VRAM with 15.8 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Gemma 3 MoE 9B?

Gemma 3 MoE 9B requires 7 GB VRAM minimum with Q4_K_M quantization. For full precision you need 7 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Gemma 3 MoE 9B?

To run Gemma 3 MoE 9B, you need a GPU with at least 12 GB of VRAM. The model requires 7.0 GB of VRAM, but a 12 GB card is recommended for optimal performance.

Is Gemma 3 MoE 9B good for coding?

Gemma 3 MoE 9B is well-suited for coding tasks due to its strong contextual understanding and ability to generate coherent code snippets. However, specialized models like Codex may offer more tailored performance for coding-specific tasks.

Gemma 3 MoE 9B vs Llama 3.1 8B?

Gemma 3 MoE 9B has 9 billion parameters and a context length of 8192 tokens, while Llama 3.1 8B has 8 billion parameters and a context length of 2048 tokens. Gemma 3 MoE 9B generally offers better performance in tasks requiring longer context and more parameters.

Can I run Gemma 3 MoE 9B on a Mac?

Yes, you can run Gemma 3 MoE 9B on a Mac with an M1 or M2 chip, but you will need to ensure you have the necessary dependencies and libraries installed. A GPU with at least 12 GB of VRAM is still recommended for optimal performance.

How much VRAM does Gemma 3 MoE 9B need?

Gemma 3 MoE 9B requires 7.0 GB of VRAM, but a GPU with at least 12 GB of VRAM is recommended to handle the model efficiently.

Is Gemma 3 MoE 9B censored?

Gemma 3 MoE 9B is not inherently censored, but it adheres to ethical guidelines and may filter out harmful or inappropriate content during inference.

Is Gemma 3 MoE 9B commercial-use allowed?

Gemma 3 MoE 9B is licensed under the 'gemma' license, which allows for commercial use. However, you should review the specific terms of the license for any restrictions or requirements.

Gemma 3 MoE 9B context length?

Gemma 3 MoE 9B has a context length of 8192 tokens, allowing it to process and generate text with a longer context compared to many other models.

Does Gemma 3 MoE 9B support function calling?

Gemma 3 MoE 9B supports function calling, enabling it to interact with external systems and APIs, enhancing its capabilities for complex tasks.

Gemma 3 MoE 9B quantization options?

Gemma 3 MoE 9B supports various quantization options, including 8-bit and 4-bit quantization, which can reduce the model's memory footprint and improve inference speed without significant loss in performance.

Can Gemma 3 MoE 9B run on CPU?

While Gemma 3 MoE 9B can technically run on a CPU, it is highly inefficient and slow. A GPU with at least 12 GB of VRAM is strongly recommended for practical use.

Gemma 3 MoE 9B fine-tuning?

Gemma 3 MoE 9B can be fine-tuned on specific datasets to improve performance on particular tasks. Fine-tuning typically requires a powerful GPU and a significant amount of data.

Gemma 3 MoE 9B system requirements?

To run Gemma 3 MoE 9B, you need a system with at least 12 GB of GPU VRAM, 32 GB of RAM, and a modern CPU. Additionally, ensure you have the necessary software dependencies installed.

Gemma 3 MoE 9B performance benchmark?

Gemma 3 MoE 9B can process around 100-150 tokens per second on a high-end GPU like the RTX 3090. Performance can vary based on the specific hardware and quantization used.

Gemma 3 MoE 9B for RAG?

Gemma 3 MoE 9B can be used for Retrieval-Augmented Generation (RAG) tasks, leveraging its strong contextual understanding and ability to generate coherent text based on retrieved information.

Gemma 3 MoE 9B for agents?

Gemma 3 MoE 9B is suitable for creating conversational agents due to its large context length and ability to maintain coherent dialogue over extended interactions.

Gemma 3 MoE 9B for coding vs general?

Gemma 3 MoE 9B performs well in both coding and general text generation tasks. However, for specialized coding tasks, models like Codex might offer more tailored performance.

Gemma 3 MoE 9B vs ChatGPT?

Gemma 3 MoE 9B has a larger context length (8192 tokens) and is designed for local deployment, while ChatGPT is a cloud-based service with a smaller context length (2048 tokens). Gemma 3 MoE 9B is better suited for tasks requiring longer context and local execution.

Gemma 3 MoE 9B download size?

The download size for Gemma 3 MoE 9B is approximately 18 GB for the full model, but this can vary depending on the quantization level used.

Best quant for Gemma 3 MoE 9B?

The best quantization for Gemma 3 MoE 9B depends on your specific needs. 8-bit quantization offers a good balance between performance and memory efficiency, while 4-bit quantization further reduces memory usage with a slight trade-off in performance.