~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/phi-4
Microsoft · llm
Phi-4
Microsoft's 14B parameter model. Punches well above its weight on reasoning.
14b paramsphi3mit16K ctx8.9315.01 GB vram
about·model card

Phi-4 is a large language model (LLM) developed by Microsoft, boasting 14 billion parameters and an impressive context length of 16,384 tokens. This model is particularly adept at generating coherent and contextually rich text, making it suitable for tasks that require deep understanding and nuanced responses, such as content creation, chatbots, and natural language understanding. Phi-4's architecture, based on the phi3 framework, ensures that it can handle complex and lengthy inputs, which is crucial for applications that involve extensive dialogues or detailed document analysis.

In its size class, Phi-4 holds its own, offering a balance between performance and efficiency. While it may not outperform the largest models in terms of raw capability, it provides a more practical option for users with moderate hardware. The model is available in several quantized versions, including Q4_K_M, Q5_K_M, and Q8_0, which can significantly reduce the VRAM requirements, making it viable for systems with 8.9 to 15.0 GB of VRAM. This makes Phi-4 a compelling choice for developers and researchers who need a powerful LLM but may not have access to the most advanced hardware. Ideal users include those working on projects that demand high-quality text generation and understanding, but with a more pragmatic approach to resource management.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·3 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.58.431 GB8.93 GB9.43 GB
85%
Q5_K_M5.59.876 GB10.38 GB10.88 GB
90%
Q8_0814.51 GB15.01 GB15.51 GB
98%

Context window & KV cache

Adds 1.25 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 16K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Phi-4

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull phi4
  2. 2

    Chat

    ollama run phi4
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"phi4","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Phi-4 on actual hardware.

GPUMedian tok/sReportsTypical setup
RTX 409076.81Q4_K_M · Ollama · Linux · 4K ctx
M2 Max28.51Q4_K_M · Ollama · macOS · 4K ctx
RTX 3060 12GB24.11Q4_K_M · Ollama · Windows · 4K ctx

Self-host serving plan

Want to host Phi-4for many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

10.4 GB

8.9 GB weights + 0.9 GB KV

Aggregate tok/s

18

across 1 user

Per-user tok/s

18

14 B dense

✅ Fits in 24 GB VRAM with 13.6 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Phi-4?

Phi-4 requires 8.93 GB VRAM minimum with Q4_K_M quantization. For full precision you need 15.01 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.