Getting Started8 min readUpdated 2026-04-09

VRAM Explained: How Much Do You Need for AI Models?

VRAM (Video Random Access Memory) is the single most important specification for running AI models locally. Understanding how models use VRAM helps you choose the right hardware, pick appropriate model sizes, and troubleshoot memory issues. This guide explains everything you need to know.

What VRAM is and why it matters

VRAM is the dedicated memory on your graphics card. It is physically separate from your computer's main RAM and is connected to the GPU chip with very high bandwidth. When you load an AI model, the model weights must be stored in VRAM for the GPU to process them quickly. If the weights do not fit in VRAM, the GPU cannot run the model at full speed. The model either fails to load, runs partially on CPU (much slower), or uses disk swapping (extremely slow).

For Apple Silicon Macs, the concept is slightly different. Apple chips use unified memory that is shared between the CPU and GPU. This means your total system RAM functions similarly to VRAM, though the operating system and other applications also need some of that memory.

How to calculate VRAM requirements

The VRAM a model needs depends on three factors: parameter count, quantization level, and overhead. The base formula is straightforward. Take the number of parameters in billions, multiply by the bytes per parameter for your quantization level, and add overhead for the KV cache and runtime.

For common quantization levels, the bytes per parameter are approximately: FP16 uses 2 bytes, Q8 uses 1 byte, Q5_K_M uses about 0.69 bytes, Q4_K_M uses about 0.56 bytes, and Q3_K_M uses about 0.44 bytes. So a 7B model in Q4_K_M uses roughly 7 times 0.56, which equals about 3.9GB for weights alone. Add approximately 0.5 to 1.5GB for the KV cache and runtime overhead, and you need about 4.5 to 5.5GB total. This is why a 6GB GPU can run 7B Q4_K_M but feels tight.

VRAM requirements by model size

Here are practical VRAM needs for popular model sizes in Q4_K_M quantization, the most common choice. A 1B to 3B model needs 2 to 3GB and runs on any modern GPU. A 7B model needs 5 to 6GB and requires an 8GB GPU for comfortable use. A 13B to 14B model needs 9 to 11GB and requires a 12GB or 16GB GPU. A 32B to 34B model needs 20 to 22GB and requires a 24GB GPU. A 70B model needs 40 to 44GB and requires a 48GB GPU, multi-GPU setup, or high-memory Apple Silicon.

These numbers assume a moderate context length of 2048 to 4096 tokens. Longer context windows increase KV cache size and require additional VRAM. Doubling the context length adds roughly 0.5 to 2GB depending on model architecture and size.

The KV cache: the hidden VRAM consumer

When a model processes text, it stores intermediate calculations called key-value pairs for each token in the context. This KV cache grows with the context length and can consume significant VRAM. For a 7B model with a 4096 token context, the KV cache needs about 0.5GB. At 32K context, it can grow to 2 to 4GB. At 128K context, some models need 8GB or more just for the KV cache on top of the model weights.

This is why your model might load successfully but then run out of memory during a long conversation. The model weights fit, but as the conversation grows, the KV cache eventually exceeds available VRAM. If this happens, reduce your context length setting or use a model with GQA (Grouped Query Attention), which reduces KV cache size by sharing attention heads.

GPU offloading: when the model does not fully fit

Most local inference tools support partial GPU offloading, where some model layers run on the GPU and the rest run on the CPU using system RAM. This lets you run models larger than your VRAM at the cost of reduced speed. The performance depends on how many layers fit in VRAM. If 80 percent of layers are on the GPU, you get roughly 80 percent of full GPU speed. At 50 percent, you get about 50 percent. Below 30 percent GPU offload, the CPU becomes the bottleneck and you might as well run entirely on CPU.

In Ollama, GPU offloading happens automatically. LM Studio lets you configure the number of GPU layers manually. The optimal setting is to offload as many layers as fit in your VRAM while leaving 0.5 to 1GB free for the KV cache and overhead.

Tips for maximizing available VRAM

Close all unnecessary applications before loading a model. Web browsers, video players, and even desktop widgets consume VRAM. On Windows, the desktop compositor uses about 0.5GB of VRAM by default. On Linux, switching to a lightweight window manager can reclaim that. Disable hardware acceleration in your browser if you are not closing it. Use the smallest quantization that meets your quality needs. Moving from Q5_K_M to Q4_K_M saves about 20 percent VRAM. Set your context length to the minimum you actually need rather than the maximum the model supports.

Monitoring VRAM usage

On NVIDIA GPUs, run nvidia-smi in the terminal to see current VRAM usage. On macOS, open Activity Monitor and check the GPU section. On Windows, Task Manager's Performance tab shows GPU memory. Monitor these during model loading and during a conversation to understand your VRAM consumption patterns. If usage creeps up during a conversation, your KV cache is growing and you may eventually hit the limit.

Planning for the future

AI models are growing in capability at every size class. The 7B models of 2026 are far better than the 7B models of 2024. This trend means you do not necessarily need more VRAM each year to get better results. However, if you want to run the latest and largest models, VRAM requirements do increase. Buying a GPU with at least 12GB in 2026 gives you access to most 7B and some 14B models, which will remain the sweet spot for personal use for the foreseeable future. If your budget allows, 16GB or 24GB provides meaningful headroom.