Multi-GPU planner

Split a model across N cards

Closes the apxml / devtk parity gap: how does a 70 B / 405 B model land on a stack of GPUs? We model both tensor-parallel and pipeline-parallel splits, plus KV cache and per-card overhead.

Your GPU stack

2 cards · 48 GB total

Split strategy

What you want to run

ModelQuantizationContext length: 8 K

Fits every card

Tensor parallel · 2-way split

Per-card weights

20.05 GB

Per-card KV

1.25 GB

Per-card overhead

0.50 GB

Total per card: 21.80 GB · headroom on the smallest card (24 GB): 0.28 GB

Math note: tensor-parallel evenly divides weights and KV cache across cards plus ~0.5 GB overhead per card. Pipeline-parallel adds ~0.8 GB per card for activation buffering. Real-world behaviour depends on the runtime — vLLM and TGI handle tensor split natively; llama.cpp’s `--n-gpu-layers` is closer to pipeline.

Single-card view of Llama 3.1 70B Instruct → · Rent a multi-GPU box →