~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/granite-3.3-2b-instruct
IBM · llm
Granite 3.3 2B
IBM's compact 2B model. Good at following instructions.
2b paramsgraniteapache-2.08K ctx1.943.01 GB vram
about·model card

Granite 3.3 2B is a large language model developed by IBM, boasting 2 billion parameters and a context length of 8192 tokens. This model excels in text generation tasks, including summarization, translation, and creative writing. Its architecture is designed to balance computational efficiency with performance, making it a solid choice for users who need a capable model without the resource demands of larger models. In its size class, Granite 3.3 2B holds its own, often delivering results that are competitive with models of similar parameter counts. It is particularly noted for its efficient use of resources, requiring only 1.9–3.0 GB of VRAM, which makes it accessible on a wide range of hardware, including mid-range GPUs.

Ideal users for Granite 3.3 2B include developers, researchers, and hobbyists who require a versatile text generation tool but have limited computational resources. The model’s availability in quantized versions (Q4_K_M, Q8_0) further enhances its efficiency, making it suitable for deployment on lower-end hardware. For those looking to run a powerful yet manageable AI model locally, Granite 3.3 2B is a strong contender, offering a good balance between performance and resource consumption.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.51.439 GB1.94 GB2.44 GB
85%
Q8_082.509 GB3.01 GB3.51 GB
98%

Context window & KV cache

Adds 0.17 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 8K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Granite 3.3 2B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull granite3.3:2b
  2. 2

    Chat

    ollama run granite3.3:2b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"granite3.3:2b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Granite 3.3 2B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Granite 3.3 2Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

2.8 GB

1.9 GB weights + 0.4 GB KV

Aggregate tok/s

125

across 1 user

Per-user tok/s

125

2 B dense

✅ Fits in 24 GB VRAM with 21.2 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Granite 3.3 2B?

Granite 3.3 2B requires 1.94 GB VRAM minimum with Q4_K_M quantization. For full precision you need 3.01 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Granite 3.3 2B?

To run Granite 3.3 2B, you need a GPU with at least 1.9 GB of VRAM for the lowest quantization level, up to 3.0 GB for higher levels.

Is Granite 3.3 2B good for coding?

Yes, Granite 3.3 2B is well-suited for coding tasks due to its strong instruction-following capabilities and 8192 context length.

Granite 3.3 2B vs Llama 3.1 8B?

Granite 3.3 2B has fewer parameters (2B vs 8B) but is more efficient in terms of VRAM usage and can handle longer contexts (8192 tokens vs typically 2048 tokens for Llama 3.1 8B).

Can I run Granite 3.3 2B on a Mac?

Yes, you can run Granite 3.3 2B on a Mac, provided your Mac has a compatible GPU with sufficient VRAM and the necessary drivers installed.

How much VRAM does Granite 3.3 2B need?

Granite 3.3 2B requires between 1.9 GB and 3.0 GB of VRAM, depending on the quantization level used.

Is Granite 3.3 2B censored?

No, Granite 3.3 2B is not censored; it is designed to follow instructions and generate content without built-in censorship mechanisms.

Is Granite 3.3 2B commercial-use allowed?

Yes, Granite 3.3 2B is licensed under Apache-2.0, which allows for commercial use as long as you comply with the license terms.

Granite 3.3 2B context length?

The context length for Granite 3.3 2B is 8192 tokens, allowing it to process longer sequences of text effectively.

Does Granite 3.3 2B support function calling?

Yes, Granite 3.3 2B supports function calling, enabling it to interact with external systems and APIs.

Granite 3.3 2B quantization options?

Granite 3.3 2B supports various quantization options, including INT8 and INT4, which can reduce VRAM usage and improve inference speed.

Can Granite 3.3 2B run on CPU?

Yes, Granite 3.3 2B can run on a CPU, though performance will be significantly slower compared to running on a GPU.

Granite 3.3 2B fine-tuning?

Granite 3.3 2B can be fine-tuned on your own data to improve its performance on specific tasks or domains.

Granite 3.3 2B system requirements?

To run Granite 3.3 2B, you need a system with at least 16 GB of RAM, a compatible GPU with 1.9 GB to 3.0 GB of VRAM, and a modern CPU.

Granite 3.3 2B performance benchmark?

Granite 3.3 2B processes around 100-150 tokens per second on a mid-range GPU, with performance varying based on quantization and hardware.

Granite 3.3 2B for RAG?

Yes, Granite 3.3 2B can be used for Retrieval-Augmented Generation (RAG) to enhance its context and provide more accurate responses.

Granite 3.3 2B for agents?

Granite 3.3 2B is suitable for creating conversational agents due to its strong instruction-following abilities and support for function calling.

Granite 3.3 2B for coding vs general?

Granite 3.3 2B performs well in both coding and general tasks, but its 8192 context length makes it particularly effective for coding, where understanding longer code snippets is crucial.

Granite 3.3 2B vs ChatGPT?

Granite 3.3 2B is smaller (2B parameters) and more efficient in terms of VRAM usage compared to ChatGPT, but may have slightly less sophisticated language understanding.

Granite 3.3 2B download size?

The download size for Granite 3.3 2B varies depending on the quantization level, ranging from approximately 2 GB to 4 GB.

Best quant for Granite 3.3 2B?

The best quantization for Granite 3.3 2B depends on your hardware, but INT8 is often a good balance between performance and VRAM efficiency.