~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/qwen2.5-coder-7b-instruct
Alibaba · code
Qwen 2.5 Coder 7B
Strong 7B code model rivaling larger coding models. Excellent for local development.
7.6b paramsqwen2apache-2.032K ctx4.868.04 GB vram
about·model card

Qwen 2.5 Coder 7B by Alibaba is a powerful text generation model specifically tailored for coding tasks. With 7.6 billion parameters, this model excels at generating high-quality code snippets, completing code blocks, and providing useful suggestions for developers. Its context length of 32768 tokens allows it to handle complex and lengthy programming tasks, making it suitable for both small scripts and large-scale projects. The model is licensed under Apache-2.0, ensuring it is freely available for both personal and commercial use.

In its size class, Qwen 2.5 Coder 7B holds its own, offering a balance between performance and resource efficiency. It is capable of delivering results that are competitive with larger models while requiring less computational power. This makes it an attractive option for developers who need robust code generation capabilities without the need for high-end hardware. The model supports quantization options like Q4_K_M and Q8_0, which further enhance its efficiency, allowing it to run smoothly on systems with 4.9 to 8.0 GB of VRAM. Ideal users include software developers, data scientists, and anyone involved in coding who wants a reliable tool to assist with their work. Realistic hardware requirements include mid-range GPUs, making it accessible for a wide range of users from hobbyists to professionals.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·2 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.54.361 GB4.86 GB5.36 GB
85%
Q8_087.542 GB8.04 GB8.54 GB
98%

Context window & KV cache

Adds 1.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 32K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Qwen 2.5 Coder 7B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull qwen2.5-coder:7b
  2. 2

    Chat

    ollama run qwen2.5-coder:7b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"qwen2.5-coder:7b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Qwen 2.5 Coder 7B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Qwen 2.5 Coder 7Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

6.0 GB

4.9 GB weights + 0.7 GB KV

Aggregate tok/s

33

across 1 user

Per-user tok/s

33

7.6 B dense

✅ Fits in 24 GB VRAM with 18.0 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

bench://measured·hf-inference · 4/28/2026real numbers
streaming-inference measurement, not an estimate.
21.0t/s
sustained throughput
2535ms
time to first token
84tok
generated in 4.0s
21t/s
end-to-end

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Qwen 2.5 Coder 7B?

Qwen 2.5 Coder 7B requires 4.86 GB VRAM minimum with Q4_K_M quantization. For full precision you need 8.04 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.