~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/gemma-3-4b-it
Google · llm
Gemma 3 4B
Balanced 4B model with strong reasoning. Great for iPhones.
4b paramsgemma3gemma32K ctx2.824.35 GB vram
about·model card

Gemma 3 4B is a large language model developed by Google, designed for advanced text generation tasks. With 4 billion parameters, it strikes a balance between performance and resource requirements, making it suitable for a wide range of applications such as content creation, chatbot development, and natural language understanding. The model's context length of 32,768 tokens allows it to handle longer sequences of text, which is particularly useful for generating coherent and contextually rich outputs. It is licensed under the gemma license, which is generally permissive for both research and commercial use.

In its size class, Gemma 3 4B punches above its weight, offering performance that rivals larger models while requiring significantly less computational resources. This makes it an efficient choice for users who need high-quality text generation without the need for top-tier hardware. The available quantizations, Q4_K_M and Q8_0, further enhance its efficiency, reducing the VRAM requirement to a range of 2.8–4.3 GB, which is manageable even on mid-range GPUs. Ideal users include developers, researchers, and hobbyists who are looking for a powerful yet accessible model for local deployment. Realistic hardware for running Gemma 3 4B includes modern GPUs with at least 4 GB of VRAM, making it a practical choice for a broad audience.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.52.319 GB2.82 GB3.32 GB
85%
Q8_083.847 GB4.35 GB4.85 GB
98%

Context window & KV cache

Adds 0.66 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Gemma 3 4B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull gemma3:4b
  2. 2

    Chat

    ollama run gemma3:4b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"gemma3:4b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Gemma 3 4B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Gemma 3 4Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

3.8 GB

2.8 GB weights + 0.5 GB KV

Aggregate tok/s

63

across 1 user

Per-user tok/s

63

4 B dense

✅ Fits in 24 GB VRAM with 20.2 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Gemma 3 4B?

Gemma 3 4B requires 2.82 GB VRAM minimum with Q4_K_M quantization. For full precision you need 4.35 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Gemma 3 4B?

To run Gemma 3 4B, you need a GPU with at least 2.8 GB of VRAM for the lowest quantization level, up to 4.3 GB for higher quantizations.

Is Gemma 3 4B good for coding?

Gemma 3 4B is well-suited for coding tasks due to its strong reasoning capabilities and large context length of 32,768 tokens.

Gemma 3 4B vs Llama 3.1 8B?

Gemma 3 4B has fewer parameters (4B vs 8B) but offers a larger context length (32,768 tokens) and better performance on mobile devices like iPhones.

Can I run Gemma 3 4B on a Mac?

Yes, you can run Gemma 3 4B on a Mac, especially if your Mac has a compatible GPU with at least 2.8 GB of VRAM.

How much VRAM does Gemma 3 4B need?

Gemma 3 4B requires between 2.8 GB and 4.3 GB of VRAM, depending on the quantization level used.

Is Gemma 3 4B censored?

Gemma 3 4B is not inherently censored, but its responses may be filtered based on the implementation and settings used.

Is Gemma 3 4B commercial-use allowed?

Gemma 3 4B is licensed under the 'gemma' license, which allows for commercial use, provided you comply with the terms of the license.

Gemma 3 4B context length?

Gemma 3 4B has a context length of 32,768 tokens, allowing it to handle very long sequences of text.

Does Gemma 3 4B support function calling?

Gemma 3 4B supports function calling, enabling it to interact with external systems and APIs effectively.

Gemma 3 4B quantization options?

Gemma 3 4B supports various quantization options, including 4-bit, 8-bit, and 16-bit, to optimize performance and memory usage.

Can Gemma 3 4B run on CPU?

While Gemma 3 4B can run on a CPU, it will be significantly slower compared to running on a GPU with sufficient VRAM.

Gemma 3 4B fine-tuning?

Gemma 3 4B can be fine-tuned for specific tasks using frameworks like Hugging Face Transformers, but it requires a powerful GPU and sufficient VRAM.

Gemma 3 4B system requirements?

Gemma 3 4B requires a GPU with at least 2.8 GB of VRAM, 16 GB of RAM, and a modern CPU for optimal performance.

Gemma 3 4B performance benchmark?

Gemma 3 4B processes around 50-100 tokens per second on a high-end GPU, with performance varying based on the quantization level and hardware.

Gemma 3 4B for RAG?

Gemma 3 4B is suitable for Retrieval-Augmented Generation (RAG) tasks due to its strong reasoning and large context length.

Gemma 3 4B for agents?

Gemma 3 4B can be used to power conversational agents and chatbots, leveraging its strong reasoning and context handling capabilities.

Gemma 3 4B for coding vs general?

Gemma 3 4B performs well in both coding and general tasks, but its large context length makes it particularly effective for coding and technical content.

Gemma 3 4B vs ChatGPT?

Gemma 3 4B has a larger context length (32,768 tokens) and is more lightweight (4B parameters), making it better suited for mobile and resource-constrained environments compared to ChatGPT.

Gemma 3 4B download size?

The download size of Gemma 3 4B varies based on the quantization level, ranging from approximately 1.5 GB for 4-bit quantization to 8 GB for full precision.

Best quant for Gemma 3 4B?

The best quantization for Gemma 3 4B depends on your use case, but 8-bit quantization offers a good balance between performance and memory efficiency.