Granite 3.3 2B is a large language model developed by IBM, boasting 2 billion parameters and a context length of 8192 tokens. This model excels in text generation tasks, including summarization, translation, and creative writing. Its architecture is designed to balance computational efficiency with performance, making it a solid choice for users who need a capable model without the resource demands of larger models. In its size class, Granite 3.3 2B holds its own, often delivering results that are competitive with models of similar parameter counts. It is particularly noted for its efficient use of resources, requiring only 1.9–3.0 GB of VRAM, which makes it accessible on a wide range of hardware, including mid-range GPUs.
Ideal users for Granite 3.3 2B include developers, researchers, and hobbyists who require a versatile text generation tool but have limited computational resources. The model’s availability in quantized versions (Q4_K_M, Q8_0) further enhances its efficiency, making it suitable for deployment on lower-end hardware. For those looking to run a powerful yet manageable AI model locally, Granite 3.3 2B is a strong contender, offering a good balance between performance and resource consumption.
| Quantization | Bits | File Size | VRAM Needed | RAM Needed | Quality |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 1.439 GB | 1.94 GB | 2.44 GB | 85% |
| Q8_0 | 8 | 2.509 GB | 3.01 GB | 3.51 GB | 98% |
Context window & KV cache
Adds 0.17 GB to VRAMLong chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.
Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.
How to run Granite 3.3 2B
Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.
Easiest. Single command. OpenAI-compatible API on :11434.
Ollama home →- 1
Pull the model
ollama pull granite3.3:2b - 2
Chat
ollama run granite3.3:2b - 3
Use as API
curl http://localhost:11434/api/chat \ -d '{"model":"granite3.3:2b","messages":[{"role":"user","content":"Hi"}]}'
Community benchmarks
Real tokens/sec reports from people running Granite 3.3 2B on actual hardware.
No community runs yet for this model. Be the first to submit your numbers.
Self-host serving plan
Want to host Granite 3.3 2Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.
VRAM needed
2.8 GB
1.9 GB weights + 0.4 GB KV
Aggregate tok/s
125
across 1 user
Per-user tok/s
125
2 B dense
✅ Fits in 24 GB VRAM with 21.2 GB headroom. Pure-GPU inference — full speed.
Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.
See It In Action
Real model outputs generated via RunThisModel.com — watch responses stream in real time.
Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.
how much VRAM do I need to run Granite 3.3 2B?
Granite 3.3 2B requires 1.94 GB VRAM minimum with Q4_K_M quantization. For full precision you need 3.01 GB.
which quant should I pick?
Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.
What GPU do I need to run Granite 3.3 2B?
To run Granite 3.3 2B, you need a GPU with at least 1.9 GB of VRAM for the lowest quantization level, up to 3.0 GB for higher levels.
Is Granite 3.3 2B good for coding?
Yes, Granite 3.3 2B is well-suited for coding tasks due to its strong instruction-following capabilities and 8192 context length.
Granite 3.3 2B vs Llama 3.1 8B?
Granite 3.3 2B has fewer parameters (2B vs 8B) but is more efficient in terms of VRAM usage and can handle longer contexts (8192 tokens vs typically 2048 tokens for Llama 3.1 8B).
Can I run Granite 3.3 2B on a Mac?
Yes, you can run Granite 3.3 2B on a Mac, provided your Mac has a compatible GPU with sufficient VRAM and the necessary drivers installed.
How much VRAM does Granite 3.3 2B need?
Granite 3.3 2B requires between 1.9 GB and 3.0 GB of VRAM, depending on the quantization level used.
Is Granite 3.3 2B censored?
No, Granite 3.3 2B is not censored; it is designed to follow instructions and generate content without built-in censorship mechanisms.
Is Granite 3.3 2B commercial-use allowed?
Yes, Granite 3.3 2B is licensed under Apache-2.0, which allows for commercial use as long as you comply with the license terms.
Granite 3.3 2B context length?
The context length for Granite 3.3 2B is 8192 tokens, allowing it to process longer sequences of text effectively.
Does Granite 3.3 2B support function calling?
Yes, Granite 3.3 2B supports function calling, enabling it to interact with external systems and APIs.
Granite 3.3 2B quantization options?
Granite 3.3 2B supports various quantization options, including INT8 and INT4, which can reduce VRAM usage and improve inference speed.
Can Granite 3.3 2B run on CPU?
Yes, Granite 3.3 2B can run on a CPU, though performance will be significantly slower compared to running on a GPU.
Granite 3.3 2B fine-tuning?
Granite 3.3 2B can be fine-tuned on your own data to improve its performance on specific tasks or domains.
Granite 3.3 2B system requirements?
To run Granite 3.3 2B, you need a system with at least 16 GB of RAM, a compatible GPU with 1.9 GB to 3.0 GB of VRAM, and a modern CPU.
Granite 3.3 2B performance benchmark?
Granite 3.3 2B processes around 100-150 tokens per second on a mid-range GPU, with performance varying based on quantization and hardware.
Granite 3.3 2B for RAG?
Yes, Granite 3.3 2B can be used for Retrieval-Augmented Generation (RAG) to enhance its context and provide more accurate responses.
Granite 3.3 2B for agents?
Granite 3.3 2B is suitable for creating conversational agents due to its strong instruction-following abilities and support for function calling.
Granite 3.3 2B for coding vs general?
Granite 3.3 2B performs well in both coding and general tasks, but its 8192 context length makes it particularly effective for coding, where understanding longer code snippets is crucial.
Granite 3.3 2B vs ChatGPT?
Granite 3.3 2B is smaller (2B parameters) and more efficient in terms of VRAM usage compared to ChatGPT, but may have slightly less sophisticated language understanding.
Granite 3.3 2B download size?
The download size for Granite 3.3 2B varies depending on the quantization level, ranging from approximately 2 GB to 4 GB.
Best quant for Granite 3.3 2B?
The best quantization for Granite 3.3 2B depends on your hardware, but INT8 is often a good balance between performance and VRAM efficiency.