Quantization in LLMs

Trading numerical precision for efficiency.

Outcome School

Mar 15, 2026

I’m Amit Shekhar from Outcome School, where I teach AI and Machine Learning.

Let’s get started.

What is Quantization?

Running large AI models on smaller hardware by trading numerical precision for efficiency.

An LLM is just a file full of numbers

Every LLM is essentially a large file containing billions of weights. These weights are the numbers the model learned during training. They define how the model thinks.

Each weight is stored with a precision

By default, each weight is a 32-bit floating point number - highly precise, but expensive in memory. A 7B model has 7 billion of these. That adds up to ~28 GB just to load it.

Quantization reduces that precision

Instead of storing a float like 1.9182736 with 7 significant digits (FP32), quantization reduces that to just 3-4 significant digits (FP16) like 1.918. The model gets smaller and faster - with only a small drop in quality.

FP32 ~7 significant digits 1.9182736 float - full precision

FP16 ~3–4 significant digits 1.918 float - half precision

If you want to learn everything about the LLM, RAG, MCP, Agent, Fine-tuning, and Quantization, refer to the AI Engineering Explained: LLM, RAG, MCP, Agent, Fine-Tuning, Quantization.

For a quick reference on model sizes, FP32 vs FP16:

Each weight in an LLM is a floating point number stored with a certain precision. So with quantization, we are cutting memory in half with minimal quality loss. This is the most common quantization step used in production LLM inference today.

Thanks

Amit Shekhar

Founder, Outcome School

Outcome School Newsletter

Discussion about this post

Ready for more?