~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/granite-3.0-3b-a800m
IBM · llm
Granite 3.0 3B-A800M
IBM enterprise-grade small MoE. 3.4 B total, 800 M active. Long context, function-calling.
3.4b paramsgranitemoeapache-2.04K ctx2.422.42 GB vramMoE
about·model card

Granite 3.0 3B-A800M is the bigger Granite MoE. Still small enough for laptop / SBC inference, but the active-parameter count of 800 M gives it noticeably better instruction-following than the 1B-A400M sibling. IBM positions it for enterprise use cases — function calling, RAG, structured output.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.51.918 GB2.42 GB2.92 GB
85%

Context window & KV cache

Adds 0.33 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 4K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Granite 3.0 3B-A800M

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/granite-3.0-3b-a800m-instruct-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Granite 3.0 3B-A800M on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Granite 3.0 3B-A800Mfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

3.4 GB

2.4 GB weights + 0.5 GB KV

Aggregate tok/s

313

across 1 user

Per-user tok/s

313

MoE active params

✅ Fits in 24 GB VRAM with 20.6 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Granite 3.0 3B-A800M?

Granite 3.0 3B-A800M requires 2.42 GB VRAM minimum with Q4_K_M quantization. For full precision you need 2.42 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Granite 3.0 3B-A800M?

To run Granite 3.0 3B-A800M, you need a GPU with at least 2.4 GB of VRAM, such as an NVIDIA RTX 2060 or better.

Is Granite 3.0 3B-A800M good for coding?

Yes, Granite 3.0 3B-A800M is well-suited for coding tasks due to its long context length of 4096 tokens and support for function calling.

Granite 3.0 3B-A800M vs Llama 3.1 8B?

Granite 3.0 3B-A800M has fewer parameters (3.4B vs 8B) but is optimized for efficiency and supports function calling, making it more suitable for resource-constrained environments.

Can I run Granite 3.0 3B-A800M on a Mac?

Yes, you can run Granite 3.0 3B-A800M on a Mac with a compatible GPU and the necessary drivers installed.

How much VRAM does Granite 3.0 3B-A800M need?

Granite 3.0 3B-A800M requires 2.4 GB of VRAM, which can vary slightly depending on the quantization level used.

Is Granite 3.0 3B-A800M censored?

No, Granite 3.0 3B-A800M is not censored, but it adheres to ethical guidelines and may filter out harmful content.

Is Granite 3.0 3B-A800M commercial-use allowed?

Yes, Granite 3.0 3B-A800M is licensed under the Apache-2.0 license, allowing for both commercial and non-commercial use.

Granite 3.0 3B-A800M context length?

The context length for Granite 3.0 3B-A800M is 4096 tokens, which is suitable for handling long and complex inputs.

Does Granite 3.0 3B-A800M support function calling?

Yes, Granite 3.0 3B-A800M supports function calling, enabling it to interact with external systems and APIs effectively.

Granite 3.0 3B-A800M quantization options?

Granite 3.0 3B-A800M supports various quantization levels, including INT8 and FP16, to optimize performance and reduce VRAM usage.

Can Granite 3.0 3B-A800M run on CPU?

While Granite 3.0 3B-A800M can run on a CPU, it will be significantly slower compared to running on a GPU due to the model's size and complexity.

Granite 3.0 3B-A800M fine-tuning?

Yes, Granite 3.0 3B-A800M can be fine-tuned for specific tasks using a dataset and a training framework like Hugging Face Transformers.

Granite 3.0 3B-A800M system requirements?

To run Granite 3.0 3B-A800M, you need a system with at least 16 GB of RAM, a compatible GPU with 2.4 GB VRAM, and a 64-bit operating system.

Granite 3.0 3B-A800M performance benchmark?

Performance benchmarks show that Granite 3.0 3B-A800M can process around 50-70 tokens per second on a mid-range GPU like the NVIDIA RTX 3060.

Granite 3.0 3B-A800M for RAG?

Yes, Granite 3.0 3B-A800M is suitable for Retrieval-Augmented Generation (RAG) tasks due to its long context length and ability to integrate with external data sources.

Granite 3.0 3B-A800M for agents?

Granite 3.0 3B-A800M is well-suited for building conversational agents and chatbots, especially those requiring long context and function calling capabilities.

Granite 3.0 3B-A800M for coding vs general?

Granite 3.0 3B-A800M performs well in both coding and general tasks, but it excels in coding due to its support for function calling and long context lengths.

Granite 3.0 3B-A800M vs ChatGPT?

Compared to ChatGPT, Granite 3.0 3B-A800M has fewer parameters but is optimized for efficiency and supports function calling, making it more suitable for resource-constrained environments.

Granite 3.0 3B-A800M download size?

The download size for Granite 3.0 3B-A800M is approximately 12 GB, which can vary slightly depending on the quantization level.

Best quant for Granite 3.0 3B-A800M?

The best quantization for Granite 3.0 3B-A800M depends on your hardware, but FP16 is generally recommended for a balance between performance and accuracy.