~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/granite-3.0-1b-a400m
IBM · llm
Granite 3.0 1B-A400M
Tiny IBM MoE for edge and CPU inference. 1.3 B total, only 400 M active.
1.3b paramsgranitemoeapache-2.04K ctx1.271.27 GB vramMoE
about·model card

Granite 3.0 1B-A400M is IBM stab at edge-class MoE. Active param count of 400 M means it can run usefully on phones, microcontrollers with 4 GB RAM, or CPU-only setups. The MoE structure preserves quality from a much bigger dense equivalent.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·1 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.50.765 GB1.27 GB1.77 GB
85%

Context window & KV cache

Adds 0.09 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 4K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Granite 3.0 1B-A400M

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

GUI. Browse → download → chat. MLX on Apple Silicon.

LM Studio home →
  1. 1

    Open LM Studio

    Go to the 🔍 Search tab.

  2. 2

    Search for

    bartowski/granite-3.0-1b-a400m-instruct-GGUF
  3. 3

    Download

    Pick the Q4_K_M quant — best balance of size vs. quality.

  4. 4

    Chat

    Hit ▶ Load Model and start chatting. Toggle 'Local Server' to expose an OpenAI-compatible API on :1234.

Community benchmarks

Real tokens/sec reports from people running Granite 3.0 1B-A400M on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Granite 3.0 1B-A400Mfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

2.1 GB

1.3 GB weights + 0.3 GB KV

Aggregate tok/s

625

across 1 user

Per-user tok/s

625

MoE active params

✅ Fits in 24 GB VRAM with 21.9 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Granite 3.0 1B-A400M?

Granite 3.0 1B-A400M requires 1.27 GB VRAM minimum with Q4_K_M quantization. For full precision you need 1.27 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.

faq://ai-curated·20 entries
What GPU do I need to run Granite 3.0 1B-A400M?

To run Granite 3.0 1B-A400M, you need a GPU with at least 1.3 GB of VRAM. GPUs like the NVIDIA RTX 2060 or higher are recommended for optimal performance.

Is Granite 3.0 1B-A400M good for coding?

Granite 3.0 1B-A400M is suitable for coding tasks due to its efficient architecture and manageable size. It can provide useful code suggestions and completions, though larger models may offer more advanced features.

Granite 3.0 1B-A400M vs Llama 3.1 8B?

Granite 3.0 1B-A400M has 1.3 billion parameters and is optimized for edge devices, while Llama 3.1 8B has 8 billion parameters and offers more complex language understanding. Llama 3.1 8B requires more VRAM and computational resources.

Can I run Granite 3.0 1B-A400M on a Mac?

Yes, you can run Granite 3.0 1B-A400M on a Mac, provided your Mac has at least 1.3 GB of VRAM. macOS supports the necessary libraries for running this model.

How much VRAM does Granite 3.0 1B-A400M need?

Granite 3.0 1B-A400M requires 1.3 GB of VRAM, which is consistent across different quantization levels.

Is Granite 3.0 1B-A400M censored?

No, Granite 3.0 1B-A400M is not censored. It adheres to the Apache-2.0 license, which allows for open and unrestricted use.

Is Granite 3.0 1B-A400M commercial-use allowed?

Yes, Granite 3.0 1B-A400M is licensed under Apache-2.0, which permits commercial use without restrictions.

Granite 3.0 1B-A400M context length?

The context length for Granite 3.0 1B-A400M is 4096 tokens, allowing for longer input sequences compared to some smaller models.

Does Granite 3.0 1B-A400M support function calling?

Granite 3.0 1B-A400M does not natively support function calling, but you can implement custom logic to handle function calls in your application.

Granite 3.0 1B-A400M quantization options?

Granite 3.0 1B-A400M supports various quantization options, including INT8 and FP16, to reduce memory usage and improve inference speed.

Can Granite 3.0 1B-A400M run on CPU?

Yes, Granite 3.0 1B-A400M can run on CPU, though performance will be slower compared to GPU. It is optimized for efficient CPU inference.

Granite 3.0 1B-A400M fine-tuning?

Granite 3.0 1B-A400M can be fine-tuned using standard fine-tuning techniques. The model's smaller size makes it easier to train on limited datasets.

Granite 3.0 1B-A400M system requirements?

To run Granite 3.0 1B-A400M, you need at least 1.3 GB of VRAM, 8 GB of RAM, and a modern CPU. A GPU is recommended for faster inference.

Granite 3.0 1B-A400M performance benchmark?

Granite 3.0 1B-A400M can process around 50-70 tokens per second on a mid-range GPU, making it suitable for real-time applications on edge devices.

Granite 3.0 1B-A400M for RAG?

Granite 3.0 1B-A400M can be used for Retrieval-Augmented Generation (RAG) tasks, but its smaller size may limit the complexity of the generated text compared to larger models.

Granite 3.0 1B-A400M for agents?

Granite 3.0 1B-A400M is well-suited for creating conversational agents due to its efficient architecture and manageable resource requirements.

Granite 3.0 1B-A400M for coding vs general?

Granite 3.0 1B-A400M performs well for both coding and general tasks, though it may not match the specialized capabilities of models specifically trained for coding or general language understanding.

Granite 3.0 1B-A400M vs ChatGPT?

ChatGPT is a larger and more complex model with over 175 billion parameters, offering superior language understanding and generation. Granite 3.0 1B-A400M is more lightweight and suitable for edge devices and resource-constrained environments.

Granite 3.0 1B-A400M download size?

The download size for Granite 3.0 1B-A400M is approximately 2.5 GB, depending on the quantization level and format.

Best quant for Granite 3.0 1B-A400M?

INT8 quantization is often the best choice for Granite 3.0 1B-A400M, providing a good balance between memory efficiency and inference speed without significant loss in accuracy.