~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/gemma-3-12b-it
Google · llm
Gemma 3 12B
High quality 12B model. Excellent for iPad Pro and Mac.
12b paramsgemma3gemma32K ctx7.312.15 GB vram
about·model card

Gemma 3 12B is a large language model developed by Google, featuring 12 billion parameters and an impressive context length of 32,768 tokens. This model excels in generating high-quality text across a wide range of tasks, including but not limited to, creative writing, summarization, and question-answering. Its extensive context window allows it to maintain coherence over longer passages, making it particularly suitable for tasks that require deep understanding and long-term memory.

In its size class, Gemma 3 12B holds its own, offering a balance between performance and resource efficiency. While it may not outperform the largest models in terms of raw capabilities, it provides a compelling trade-off between computational demands and output quality. The model supports quantization options like Q4_K_M and Q8_0, which help reduce the VRAM requirements to a range of 7.3 to 12.2 GB, making it feasible for users with mid-range GPUs. Ideal for researchers, developers, and enthusiasts who need a powerful yet manageable LLM, Gemma 3 12B is a solid choice for those looking to deploy advanced text generation capabilities on local hardware without the need for top-tier GPUs.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.56.799 GB7.3 GB7.8 GB
85%
Q8_0811.651 GB12.15 GB12.65 GB
98%

Context window & KV cache

Adds 1.25 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Gemma 3 12B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull gemma3:12b
  2. 2

    Chat

    ollama run gemma3:12b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"gemma3:12b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Gemma 3 12B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Gemma 3 12Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

8.7 GB

7.3 GB weights + 0.9 GB KV

Aggregate tok/s

21

across 1 user

Per-user tok/s

21

12 B dense

✅ Fits in 24 GB VRAM with 15.3 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Gemma 3 12B?

Gemma 3 12B requires 7.3 GB VRAM minimum with Q4_K_M quantization. For full precision you need 12.15 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Gemma 3 12B?

To run Gemma 3 12B, you need a GPU with at least 7.3 GB of VRAM, but 12.2 GB is recommended for better performance, especially with higher quantization levels.

Is Gemma 3 12B good for coding?

Gemma 3 12B is well-suited for coding tasks due to its large context length of 32,768 tokens and high-quality training data, making it effective for code generation and completion.

Gemma 3 12B vs Llama 3.1 8B?

Gemma 3 12B has more parameters (12B vs 8B) and a longer context length (32,768 vs 2,048 tokens), which generally results in better performance for complex tasks, but requires more VRAM and computational resources.

Can I run Gemma 3 12B on a Mac?

Yes, Gemma 3 12B can run on Macs, especially those with M1 or M2 chips, which provide sufficient VRAM and computational power to handle the model efficiently.

How much VRAM does Gemma 3 12B need?

Gemma 3 12B requires between 7.3 GB and 12.2 GB of VRAM, depending on the quantization level used. Higher quantization levels reduce VRAM usage but may slightly impact performance.

Is Gemma 3 12B censored?

Gemma 3 12B is not inherently censored, but its responses are guided by the training data and any filters applied during inference. Users can implement additional content moderation as needed.

Is Gemma 3 12B commercial-use allowed?

Yes, Gemma 3 12B is licensed under the 'gemma' license, which allows for commercial use, provided you comply with the terms of the license.

Gemma 3 12B context length?

Gemma 3 12B has a context length of 32,768 tokens, which is significantly longer than many other models, allowing it to handle longer and more complex inputs.

Does Gemma 3 12B support function calling?

Gemma 3 12B supports function calling, enabling it to interact with external systems and APIs, enhancing its capabilities for various applications.

Gemma 3 12B quantization options?

Gemma 3 12B supports multiple quantization options, including INT8 and INT4, which reduce VRAM usage and improve inference speed while maintaining acceptable accuracy.

Can Gemma 3 12B run on CPU?

While Gemma 3 12B can technically run on a CPU, it is highly inefficient and slow. Using a GPU with sufficient VRAM is strongly recommended for practical performance.

Gemma 3 12B fine-tuning?

Gemma 3 12B can be fine-tuned on custom datasets to improve performance on specific tasks. Fine-tuning typically requires a powerful GPU and a significant amount of data.

Gemma 3 12B system requirements?

To run Gemma 3 12B, you need a system with at least 7.3 GB of VRAM, 32 GB of RAM, and a multi-core CPU. For optimal performance, a GPU with 12.2 GB of VRAM and an SSD are recommended.

Gemma 3 12B performance benchmark?

Gemma 3 12B can process around 50-100 tokens per second on a high-end GPU like the RTX 3090, depending on the quantization level and batch size.

Gemma 3 12B for RAG?

Gemma 3 12B is suitable for Retrieval-Augmented Generation (RAG) tasks due to its large context length and ability to handle complex queries, making it effective for integrating external knowledge sources.

Gemma 3 12B for agents?

Gemma 3 12B can be used to create intelligent agents due to its strong natural language understanding and generation capabilities, making it suitable for chatbots, virtual assistants, and other conversational applications.

Gemma 3 12B for coding vs general?

Gemma 3 12B performs well in both coding and general tasks, but its large context length and specialized training data make it particularly strong for coding-related tasks such as code generation and documentation.

Gemma 3 12B vs ChatGPT?

Gemma 3 12B has a larger context length (32,768 vs 2,048 tokens) and is specifically optimized for local deployment, while ChatGPT is a cloud-based service with a different set of capabilities and use cases.

Gemma 3 12B download size?

The download size of Gemma 3 12B varies depending on the quantization level. The full model is approximately 24 GB, but quantized versions can be as small as 6 GB.

Best quant for Gemma 3 12B?

The best quantization for Gemma 3 12B depends on your hardware. INT8 provides a good balance between performance and VRAM usage, while INT4 is more efficient but may have a slight drop in accuracy.