~/runthismodel
daemon okbuild 5a3c91d00:00:00Z
./models/browse/phi-3.5-mini-instruct
Microsoft · llm
Phi-3.5 Mini 3.8B
Tiny but capable 3.8B model. Runs on almost any hardware including phones.
3.8b paramsphi3mit128K ctx2.734.28 GB vram
about·model card

Phi-3.5 Mini 3.8B is a compact yet powerful language model developed by Microsoft, designed for efficient local deployment. With 3.8 billion parameters, this model strikes a balance between performance and resource consumption, making it suitable for a wide range of text generation tasks such as summarization, translation, and creative writing. Its architecture, based on the phi3 framework, allows it to handle context lengths up to 131,072 tokens, which is significantly larger than many models in its size class, enabling it to maintain coherence over long texts.

Compared to other models with similar parameter counts, Phi-3.5 Mini 3.8B punches above its weight in terms of efficiency and performance. It requires only 2.7 to 4.3 GB of VRAM, making it accessible on a variety of hardware, including mid-range GPUs. This makes it an excellent choice for developers and enthusiasts who need robust text generation capabilities without the need for high-end hardware. The model is available in several quantized versions (Q4_K_M, Q5_K_M, Q8_0), further enhancing its efficiency and reducing memory usage. Ideal users include those working on projects that require extensive text processing but have limited computational resources, such as small-scale applications, personal projects, or environments with strict resource constraints.

probe://hardware·which quants fit your rig
we auto-detect via WebGL/WebGPU. select manually if your GPU isn't recognized.
./quantizations·3 variants
QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.52.229 GB2.73 GB3.23 GB
85%
Q5_K_M5.52.622 GB3.12 GB3.62 GB
90%
Q8_083.782 GB4.28 GB4.78 GB
98%

Context window & KV cache

Adds 0.66 GB to VRAM

Long chats and RAG inputs cost real memory. Drag to see how 32K vs 128K context shifts your grade.

Model native max: 128K tokens. KV-cache estimate is approximate (±30 %); real usage depends on attention layout.

How to run Phi-3.5 Mini 3.8B

Pick a runtime — copy & paste. Commands are pre-filled with this model’s repo.

Easiest. Single command. OpenAI-compatible API on :11434.

Ollama home →
  1. 1

    Pull the model

    ollama pull phi3.5
  2. 2

    Chat

    ollama run phi3.5
  3. 3

    Use as API

    curl http://localhost:11434/api/chat \
      -d '{"model":"phi3.5","messages":[{"role":"user","content":"Hi"}]}'

Community benchmarks

Real tokens/sec reports from people running Phi-3.5 Mini 3.8B on actual hardware.

No community runs yet for this model. Be the first to submit your numbers.

Self-host serving plan

Want to host Phi-3.5 Mini 3.8Bfor many users? Or run it on a card that’s technically too small? Slide the knobs.

VRAM needed

3.7 GB

2.7 GB weights + 0.5 GB KV

Aggregate tok/s

66

across 1 user

Per-user tok/s

66

3.8 B dense

✅ Fits in 24 GB VRAM with 20.3 GB headroom. Pure-GPU inference — full speed.

Throughput is a sub-linear estimate: doubling users adds ~70 % of single-user TPS until ~8, then plateaus on memory bandwidth. MoE models scale concurrency much better because each user activates a different subset of experts.

See It In Action

Real model outputs generated via RunThisModel.com — watch responses stream in real time.

Llama 3.3 70B responding...

Outputs generated by real AI models via RunThisModel.com. Generation speed shown is from cloud inference. Local speeds vary by hardware — check your device.

faq·common questions
how much VRAM do I need to run Phi-3.5 Mini 3.8B?

Phi-3.5 Mini 3.8B requires 2.73 GB VRAM minimum with Q4_K_M quantization. For full precision you need 4.28 GB.

which quant should I pick?

Q4_K_M is the best quality/VRAM balance — ~92% of FP16 quality at ~25% the footprint. Q8_0 is near-lossless if you have the headroom.