Quantization: From Max Planck to Faster Vector Search

Max Planck discovered nature comes in discrete packets in 1900. 125 years later, the same idea lets vector databases, LLMs, and on-device AI work at scale.

Mar 14, 2026·6 min read

In 1900, Max Planck changed physics forever. He discovered that energy isn't continuous, it comes in discrete "packets" called quanta. Instead of a smooth, unbroken curve, nature works in steps. That idea, quantization, would go on to reshape our understanding of the universe.

Fast forward 125 years, and we're using the same concept to make your vector database faster.

Physics → Computing

In physics, quantization meant:

Continuous energy → discrete energy levels

In computing, it's:

Continuous values → discrete buckets

The math is similar: you take a wide, high-resolution range of possible values and map it to a smaller, fixed set.

The Practical Case: Vector Search

Say your Qdrant index stores 100 million embeddings, each as a float32.

float32: 4 bytes per value, ~7 decimal digits of precision
int8: 1 byte per value, 256 possible discrete values

Quantization maps those float32's down to int8's using scaling and rounding. The values are less precise, but the memory footprint drops by ~75%. That means faster lookups, more cache hits, and cheaper storage, often without meaningful accuracy loss in nearest-neighbor search.

A Familiar Analogy: Digital Audio

If you've ever digitized music, you've already done quantization. Consider what happens when you convert vinyl records to digital:

Analog vinyl: Sound waves create smooth, continuous variations in the groove depth. Every microscopic position along the groove corresponds to a unique amplitude value, there are infinite possibilities.

Digital conversion: Your audio interface samples that continuous wave thousands of times per second and rounds each measurement to the nearest value in a fixed set of digital "steps."

For CD-quality audio, we use:

16-bit depth: 65,536 possible amplitude levels (2¹⁶)
44.1 kHz sampling rate: 44,100 samples per second

So a 3-minute song becomes exactly 3 × 60 × 44,100 × 2 = 15.9 million discrete data points, each rounded to one of 65,536 possible values.

The trade-off is identical to vector quantization:

Analog/float32: Infinite precision, massive storage requirements
Digital/int8: Fixed precision levels, dramatically smaller files
Perceptual quality: Often indistinguishable to the end user

When Spotify streams that song to your phone, you're hearing quantized audio that's been compressed even further. Yet it sounds nearly identical to the original studio recording. Your embeddings work the same way, we trade mathematical precision for practical performance, and the "listener" (your search algorithm) rarely notices the difference.

Under the Hood

Quantization in ML/vector DBs is usually a linear mapping:

1. Find the range of values in your vector:

min_val = -1.23
max_val =  2.45

2. Scale to fit the target integer range (e.g., -128 to 127 for signed int8):

scale = (max_val - min_val) / 255

3. Round and store:

int_val = round((float_val - min_val) / scale) - 128

Dequantize later by reversing the math when you need the approximate float.

The whole trick is that you keep a small set of metadata, min_val and scale, so you can recover a close approximation of the original numbers without storing them in full precision.

Before & After: Quantizing a Vector

Index	Original `float32`	Quantized `int8`	Dequantized (approx. float)
0	-1.2301	-128	-1.2300
1	-0.5324	-90	-0.5327
2	0.1478	-47	0.1476
3	1.0425	16	1.0424
4	2.4500	127	2.4500

The difference between the original and dequantized values is small enough that most AI models won't notice, yet your storage and compute bills will.

How Search Actually Uses Quantization

Here's the lifecycle of a vector search with quantization:

Storage. Vectors are stored in quantized form (int8) on disk or in memory. This keeps your index tiny and cache-friendly.
Initial Search (Approximate). ANN search runs directly on quantized values using fast integer math. Goal: quickly narrow millions of candidates to a few hundred.
Refinement (Optional). Top candidates are dequantized in memory to float space. Exact similarity (cosine, dot product, L2) is computed. Final top-k results are returned.

Quantized values get you in the ballpark fast. Dequantized values win the game.

Where Quantization Lives in Modern AI

Quantization isn't just a vector database trick, it's become essential across the AI stack:

Vector Search & Retrieval

ANN libraries like FAISS, Qdrant, and Milvus use quantization as standard practice. When you're searching through millions of embeddings from OpenAI or Cohere, those indexes are almost certainly quantized to keep memory usage manageable and queries fast.

Large Language Models

You've probably encountered quantized LLMs without realizing it:

Llama 3.1 8B: Original model ~16GB, quantized versions as small as 4GB
GPT-4 variants: Many inference providers use quantized weights to serve more requests per GPU
Local AI tools: Apps like Ollama and LM Studio rely heavily on quantized models to run LLMs on consumer hardware

The same float32 → int8 (or even int4) mapping that works for embeddings also works for the billions of parameters in transformer models.

Edge & Mobile AI

Quantization makes the difference between "AI that runs in the cloud" and "AI that runs in your pocket." Every iPhone running on-device Siri or Google Pixel doing real-time translation is using quantized neural networks.

Why This Matters

For large-scale AI systems:

Smaller models & indexes → fit more in RAM/SSD
Faster distance calculations → lower query latency
Often negligible quality loss → still returns the right neighbors
Democratized access → powerful models on consumer hardware

That's why quantization has become standard across the AI ecosystem, it's an old idea from quantum physics, now essential for modern machine learning at scale.

Conclusion

Here's the uncomfortable truth about modern AI: the algorithms powering your "intelligent" systems are running on deliberately degraded data.

Every embedding lookup in your RAG pipeline, every parameter in your local Llama model, every neural network inference on your phone, they're all using approximations. Numbers that should be precise are rounded to the nearest available bucket. Continuous mathematics becomes digital steps.

And it works better than the alternatives.

This isn't a compromise we've grudgingly accepted, it's the breakthrough that made large-scale AI possible. Without quantization, GPT-4 would require data centers the size of city blocks. Your vector database would collapse under its own weight. AI would still be trapped in research labs.

Planck discovered that nature itself is quantized, energy comes in discrete packets, not smooth curves. A century later, we've learned that artificial intelligence works the same way. Sometimes the most powerful insights come from throwing away information you thought you needed.

The future of AI isn't about perfect precision. It's about finding the right level of imperfection that unlocks everything else.