~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/gemma-2-2b-it
Google · llm
Gemma 2 2B
Google's compact 2.6B model. Efficient and capable for mobile use.
2.6b paramsgemma2gemma8K ctx2.093.09 GB vram
about·model card

Gemma 2 2B is a large language model developed by Google, boasting 2.6 billion parameters and designed for efficient local deployment. This model excels in text generation tasks, including but not limited to, creative writing, summarization, and conversational responses. With a context length of 8192 tokens, it can handle longer sequences of text, making it suitable for applications that require understanding and generating coherent, context-rich content. The model is licensed under the gemma license, ensuring accessibility while maintaining certain usage guidelines.

Compared to other models in its size class, Gemma 2 2B punches well above its weight. It offers a balance between performance and resource efficiency, making it a strong contender for those who need robust text generation capabilities without the need for high-end hardware. The available quantizations (Q4_K_M, Q8_0) further enhance its efficiency, allowing it to run smoothly on systems with as little as 2.1 GB of VRAM. This makes it an ideal choice for developers, hobbyists, and small teams looking to deploy a powerful yet manageable language model on mid-range GPUs or even some CPUs. Realistically, anyone with a modern laptop or desktop equipped with at least 4 GB of RAM and a decent GPU can leverage Gemma 2 2B for a wide range of text-based projects.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.51.591 GB2.09 GB2.59 GB
85%
Q8_082.593 GB3.09 GB3.59 GB
98%

Context window & KV cache

Adds 0.66 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Gemma 2 2B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull gemma2:2b
  2. 2

    Chat

    ollama run gemma2:2b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"gemma2:2b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Gemma 2 2B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Gemma 2 2Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

3.0 GB

2.1 GB weights + 0.4 GB KV

Aggregate tok/s

96

across 1 user

Per-user tok/s

96

2.6 B dense

✅ Fits in 24 GB VRAM with 21.0 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Gemma 2 2B?

Gemma 2 2B requires 2.09 GB VRAM minimum with Q4_K_M quantization. For full precision you need 3.09 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Gemma 2 2B?

To run Gemma 2 2B, you need a GPU with at least 2.1 GB of VRAM, but 3.1 GB is recommended for better performance, especially with higher quantization levels.

Is Gemma 2 2B good for coding?

Gemma 2 2B is suitable for coding tasks due to its efficient architecture and 8192 context length, which allows it to understand and generate longer code snippets effectively.

Gemma 2 2B vs Llama 3.1 8B?

Gemma 2 2B has fewer parameters (2.6B vs 8B) and requires less VRAM, making it more suitable for mobile and resource-constrained environments, while Llama 3.1 8B offers better performance in more complex tasks.

Can I run Gemma 2 2B on a Mac?

Yes, you can run Gemma 2 2B on a Mac, provided your Mac has a compatible GPU with at least 2.1 GB of VRAM and the necessary drivers installed.

How much VRAM does Gemma 2 2B need?

Gemma 2 2B requires between 2.1 GB and 3.1 GB of VRAM, depending on the quantization level used.

Is Gemma 2 2B censored?

Gemma 2 2B is not inherently censored, but its responses can be influenced by the training data and any filters or guidelines applied during deployment.

Is Gemma 2 2B commercial-use allowed?

Yes, Gemma 2 2B can be used commercially, but you should review the specific terms of the 'gemma' license to ensure compliance.

Gemma 2 2B context length?

Gemma 2 2B has a context length of 8192 tokens, allowing it to handle longer sequences of text effectively.

Does Gemma 2 2B support function calling?

Gemma 2 2B supports function calling, enabling it to interact with external systems and APIs, enhancing its utility in various applications.

Gemma 2 2B quantization options?

Gemma 2 2B supports multiple quantization options, including 4-bit, 8-bit, and 16-bit, which can reduce VRAM usage and improve inference speed.

Can Gemma 2 2B run on CPU?

Yes, Gemma 2 2B can run on a CPU, but it will be significantly slower compared to running on a GPU with sufficient VRAM.

Gemma 2 2B fine-tuning?

Gemma 2 2B can be fine-tuned for specific tasks using frameworks like Hugging Face Transformers, which provide tools and libraries for custom training.

Gemma 2 2B system requirements?

To run Gemma 2 2B, you need a system with at least 8 GB of RAM, a compatible GPU with 2.1 GB to 3.1 GB of VRAM, and a modern CPU. Additional storage is required for the model files.

Gemma 2 2B performance benchmark?

Gemma 2 2B can process around 50-100 tokens per second on a mid-range GPU, with performance varying based on quantization and system configuration.

Gemma 2 2B for RAG?

Gemma 2 2B can be used for Retrieval-Augmented Generation (RAG) tasks, leveraging its 8192 context length to incorporate retrieved information effectively.

Gemma 2 2B for agents?

Gemma 2 2B is well-suited for creating conversational agents due to its efficient size and ability to handle long contexts, making it ideal for chatbots and virtual assistants.

Gemma 2 2B for coding vs general?

Gemma 2 2B performs well in both coding and general tasks, but its 8192 context length and efficient architecture make it particularly strong for coding, where understanding longer sequences is crucial.

Gemma 2 2B vs ChatGPT?

Gemma 2 2B is smaller (2.6B parameters) and more resource-efficient compared to ChatGPT, which has more parameters and requires more VRAM, but may offer superior performance in complex tasks.

Gemma 2 2B download size?

The download size of Gemma 2 2B varies depending on the quantization level, ranging from approximately 1.3 GB (4-bit) to 5.2 GB (16-bit).

Best quant for Gemma 2 2B?

The best quantization for Gemma 2 2B depends on your hardware and performance needs. 8-bit quantization offers a good balance between VRAM efficiency and inference speed, while 4-bit is optimal for very low VRAM systems.