← Back to Guides
Advanced9 min readUpdated 2026-04-08

Quantization Explained: Q4, Q5, Q8, and FP16 Compared

Quantization is the process of reducing the precision of a model's numerical weights to make it smaller and faster. It is the single most important technique that makes running large AI models on consumer hardware possible. This guide explains each quantization level, its quality impact, and when to use it.

What quantization actually does

A neural network stores its knowledge in billions of numerical weights. In full precision (FP32), each weight uses 32 bits of memory. A 7 billion parameter model in FP32 would need about 28GB just for weights. Quantization reduces the number of bits per weight. FP16 uses 16 bits, halving the size. Q8 uses 8 bits, quartering it. Q4 uses 4 bits, reducing it to one-eighth of the original. The key insight is that neural networks are remarkably tolerant of reduced precision. A 7B model quantized to Q4 needs only about 4GB of memory and produces output that is surprisingly close to the full-precision version.

FP16: Full quality, full size

FP16 (16-bit floating point) is often called the baseline for model quality. It uses half the memory of FP32 with virtually no quality loss. A 7B model in FP16 needs about 14GB of VRAM. This is the format used for fine-tuning and when you need the absolute best output from a model. Most users will never need FP16 for inference because the quality difference versus Q8 is minimal, but if you have the VRAM to spare and want zero compromises, FP16 is the way to go. It is most commonly used by researchers and people fine-tuning models.

Q8_0: Near-perfect quality

Q8_0 (8-bit quantization) reduces each weight to 8 bits, cutting memory roughly in half compared to FP16. A 7B model in Q8 needs about 7.5GB. The quality loss from Q8 quantization is essentially undetectable in normal use. Blind tests consistently show that humans cannot distinguish Q8 output from FP16 output. Use Q8 when you have enough VRAM and want maximum quality without the memory overhead of FP16. It is an excellent choice for 7B models on 12GB and 16GB GPUs.

Q5_K_M: The quality sweet spot

Q5_K_M uses approximately 5.5 bits per weight on average through a technique called importance-aware mixed quantization. More important layers keep higher precision while less critical layers are quantized more aggressively. A 7B model in Q5_K_M needs about 5.5GB. This is the sweet spot for users who want both good quality and reasonable memory usage. The output quality is very close to Q8, with slight differences detectable only on complex reasoning tasks or when generating very long outputs. Q5_K_M is our recommended quantization for users with 8GB GPUs running 7B models.

Q4_K_M is the most widely used quantization level and the default in many tools. It uses approximately 4.5 bits per weight, and a 7B model needs about 4.5GB. The quality tradeoff becomes noticeable here but remains acceptable for most use cases. You may see slightly less coherent reasoning on complex multi-step problems, occasional word choice differences, and minor degradation in following intricate instructions. For general conversation, creative writing, coding assistance, and most everyday tasks, Q4_K_M output is good. This is the recommended choice when you want to fit the largest possible model into your available VRAM.

Q3_K and below: Last resort

Below Q4, quality degrades more noticeably. Q3_K_M uses about 3.5 bits per weight, and models start showing increased repetition, less coherent reasoning, and occasional factual errors that the higher-precision versions would not make. Q2_K pushes weights to roughly 2.5 bits and is primarily useful for fitting very large models into limited VRAM when no smaller model will do. The IQ (importance quantization) variants like IQ4_XS and IQ3_M use more sophisticated quantization algorithms that partially mitigate quality loss at low bit widths. If you must go below Q4, prefer IQ variants over standard Q variants.

How to choose: a practical decision framework

Start by identifying the largest model that fits your VRAM in Q4_K_M quantization. Then consider whether moving to a smaller model at Q5_K_M or Q8 might give better results. A 7B model in Q8 often outperforms a 13B model in Q3_K because the smaller model retains more of its original quality. The general rule is: prefer higher quantization of a smaller model over lower quantization of a larger model when quality matters. The exception is when the task specifically benefits from the larger model's broader knowledge, such as niche domain questions or complex multilingual translation.

Checking quality for yourself

The best way to evaluate quantization quality is to test it on your actual use cases. Download the Q4_K_M and Q8 versions of the same model, run your typical prompts through both, and compare the outputs. Pay attention to reasoning chains, factual accuracy, and instruction following rather than surface-level fluency. Most quantized models are fluent regardless of precision level. The differences show up in the accuracy and coherence of the content.