~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/deepseek-r1-distill-llama-8b
DeepSeek · llm
DeepSeek R1 Distill 8B
Compact reasoning model. Good reasoning capabilities in a small package.
8b paramsllamamit128K ctx5.088.45 GB vram
about·model card

DeepSeek R1 Distill 8B is an 8 billion parameter language model based on the LLaMA architecture, designed for efficient local deployment. This model excels in generating coherent and contextually relevant text, making it suitable for a wide range of applications such as content creation, chatbots, and natural language understanding tasks. With a context length of 131,072 tokens, it can handle long-form text generation and maintain context over extensive passages, which is particularly useful for tasks requiring deep understanding and continuity.

In its size class, DeepSeek R1 Distill 8B stands out for its balance between performance and efficiency. It offers competitive results compared to larger models while requiring significantly less computational resources. The available quantizations (Q4_K_M, Q5_K_M, Q8_0) allow for further optimization, making it viable for deployment on a variety of hardware setups with VRAM ranging from 5.1 to 8.4 GB. This makes it an excellent choice for users who want high-quality text generation without the need for high-end GPUs. Ideal users include developers, content creators, and researchers looking for a powerful yet resource-efficient model. Realistic hardware for running this model includes mid-range GPUs found in modern laptops and desktops, ensuring broad accessibility.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·3 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.54.583 GB5.08 GB5.58 GB
85%
Q5_K_M5.55.339 GB5.84 GB6.34 GB
90%
Q8_087.954 GB8.45 GB8.95 GB
98%

Context window & KV cache

Adds 1.00 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run DeepSeek R1 Distill 8B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull deepseek-r1:8b
  2. 2

    Chat

    ollama run deepseek-r1:8b
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"deepseek-r1:8b","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running DeepSeek R1 Distill 8B on actual hardware.

GPUMedian tok/sReportsTypical setup
RTX 409088.41Q4_K_M · Ollama · Linux · 8K ctx
M2 Pro24.51Q4_K_M · Ollama · macOS · 8K ctx

Self-host serving plan

Want to host DeepSeek R1 Distill 8Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

6.3 GB

5.1 GB weights + 0.7 GB KV

Aggregate tok/s

31

across 1 user

Per-user tok/s

31

8 B dense

✅ Fits in 24 GB VRAM with 17.7 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

bench://measured·hf-inference · 4/28/2026real numbers
streaming-inference measurement, not an estimate.
34.7t/s
sustained throughput
0ms
time to first token
150tok
generated in 4.3s
35t/s
end-to-end

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run DeepSeek R1 Distill 8B?

DeepSeek R1 Distill 8B requires 5.08 GB VRAM minimum with Q4_K_M quantization. For full precision you need 8.45 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.