~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/llama-3.2-1b-instruct
Meta · llm
Llama 3.2 1B Instruct
Ultra-compact 1B model. Runs on virtually any device including smartphones.
1.24b paramsllamallama3.2128K ctx1.252.81 GB vram
about·model card

Llama 3.2 1B Instruct by Meta is a lightweight yet powerful language model designed for text generation tasks. With 1.24 billion parameters, it offers a balance between performance and resource efficiency, making it suitable for a wide range of applications such as chatbots, content creation, and summarization. The model’s context length of 131,072 tokens allows it to handle long-form text, which is particularly useful for generating coherent and contextually rich outputs. It is licensed under the llama3.2 license, ensuring broad accessibility for both commercial and non-commercial projects.

Compared to other models in its size class, Llama 3.2 1B Instruct punches well above its weight. It delivers impressive results with relatively low computational requirements, making it an efficient choice for users who may not have access to high-end hardware. The available quantizations (Q4_K_M, Q8_0, FP16) further enhance its efficiency, allowing it to run smoothly on devices with as little as 1.3 GB of VRAM. This makes it an excellent option for developers and hobbyists working on laptops or mid-range desktops. Ideal users include those looking to deploy a capable language model for local applications without the need for expensive cloud services or powerful GPUs.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·3 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.50.752 GB1.25 GB1.75 GB
85%
Q8_081.23 GB1.73 GB2.23 GB
98%
FP16162.309 GB2.81 GB3.31 GB
100%

Context window & KV cache

Adds 0.17 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Llama 3.2 1B Instruct

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull llama3.2:1b
  2. 2

    Chat

    ollama run llama3.2:1b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Llama 3.2 1B Instruct on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Llama 3.2 1B Instructfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

2.0 GB

1.3 GB weights + 0.3 GB KV

Aggregate tok/s

202

across 1 user

Per-user tok/s

202

1.24 B dense

✅ Fits in 24 GB VRAM with 22.0 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

bench://measured·hf-inference · 4/28/2026real numbers
streaming-inference measurement, not an estimate.
31.6t/s
sustained throughput
3332ms
time to first token
130tok
generated in 4.1s
32t/s
end-to-end

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Llama 3.2 1B Instruct?

Llama 3.2 1B Instruct requires 1.25 GB VRAM minimum with Q4_K_M quantization. For full precision you need 2.81 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.