Multi-GPU planner
Split a model across N cards
Closes the apxml / devtk parity gap: how does a 70 B / 405 B model land on a stack of GPUs? We model both tensor-parallel and pipeline-parallel splits, plus KV cache and per-card overhead.
Your GPU stack
2 cards · 48 GB total#1
#2
Split strategy
What you want to run
Fits every card
Tensor parallel · 2-way splitPer-card weights
20.05 GB
Per-card KV
1.25 GB
Per-card overhead
0.50 GB
Total per card: 21.80 GB · headroom on the smallest card (24 GB): 0.28 GB
Math note: tensor-parallel evenly divides weights and KV cache across cards plus ~0.5 GB overhead per card. Pipeline-parallel adds ~0.8 GB per card for activation buffering. Real-world behaviour depends on the runtime — vLLM and TGI handle tensor split natively; llama.cpp’s `--n-gpu-layers` is closer to pipeline.
Single-card view of Llama 3.1 70B Instruct → · Rent a multi-GPU box →