Quantization: From Max Planck to Faster Vector Search
Max Planck discovered nature comes in discrete packets in 1900. 125 years later, the same idea lets vector databases, LLMs, and on-device AI work at scale.
In 1900, Max Planck changed physics forever. He discovered that energy isn't continuous, it comes in discrete "packets" called quanta. Instead of a smooth, unbroken curve, nature works in steps. That idea, quantization, would go on to reshape our understanding of the universe.
Fast forward 125 years, and we're using the same concept to make your vector database faster.
Physics → Computing
In physics, quantization meant:
Continuous energy → discrete energy levels
In computing, it's:
Continuous values → discrete buckets
The math is similar: you take a wide, high-resolution range of possible values and map it to a smaller, fixed set.
The Practical Case: Vector Search
Say your Qdrant index stores 100 million embeddings, each as a float32.
- float32: 4 bytes per value, ~7 decimal digits of precision
- int8: 1 byte per value, 256 possible discrete values
Quantization maps those float32's down to int8's using scaling and rounding. The values are less precise, but the memory footprint drops by ~75%. That means faster lookups, more cache hits, and cheaper storage, often without meaningful accuracy loss in nearest-neighbor search.
A Familiar Analogy: Digital Audio
If you've ever digitized music, you've already done quantization. Consider what happens when you convert vinyl records to digital:
Analog vinyl: Sound waves create smooth, continuous variations in the groove depth. Every microscopic position along the groove corresponds to a unique amplitude value, there are infinite possibilities.
Digital conversion: Your audio interface samples that continuous wave thousands of times per second and rounds each measurement to the nearest value in a fixed set of digital "steps."
For CD-quality audio, we use:
- 16-bit depth: 65,536 possible amplitude levels (2¹⁶)
- 44.1 kHz sampling rate: 44,100 samples per second
So a 3-minute song becomes exactly 3 × 60 × 44,100 × 2 = 15.9 million discrete data points, each rounded to one of 65,536 possible values.
The trade-off is identical to vector quantization:
- Analog/float32: Infinite precision, massive storage requirements
- Digital/int8: Fixed precision levels, dramatically smaller files
- Perceptual quality: Often indistinguishable to the end user
When Spotify streams that song to your phone, you're hearing quantized audio that's been compressed even further. Yet it sounds nearly identical to the original studio recording. Your embeddings work the same way, we trade mathematical precision for practical performance, and the "listener" (your search algorithm) rarely notices the difference.
Under the Hood
Quantization in ML/vector DBs is usually a linear mapping:
1. Find the range of values in your vector:
min_val = -1.23
max_val = 2.452. Scale to fit the target integer range (e.g., -128 to 127 for signed int8):
scale = (max_val - min_val) / 2553. Round and store:
int_val = round((float_val - min_val) / scale) - 128Dequantize later by reversing the math when you need the approximate float.
The whole trick is that you keep a small set of metadata, min_val and scale, so you can recover a close approximation of the original numbers without storing them in full precision.
Before & After: Quantizing a Vector
| Index | Original float32 | Quantized int8 | Dequantized (approx. float) |
|---|---|---|---|
| 0 | -1.2301 | -128 | -1.2300 |
| 1 | -0.5324 | -90 | -0.5327 |
| 2 | 0.1478 | -47 | 0.1476 |
| 3 | 1.0425 | 16 | 1.0424 |
| 4 | 2.4500 | 127 | 2.4500 |
The difference between the original and dequantized values is small enough that most AI models won't notice, yet your storage and compute bills will.
How Search Actually Uses Quantization
Here's the lifecycle of a vector search with quantization:
- Storage. Vectors are stored in quantized form (int8) on disk or in memory. This keeps your index tiny and cache-friendly.
- Initial Search (Approximate). ANN search runs directly on quantized values using fast integer math. Goal: quickly narrow millions of candidates to a few hundred.
- Refinement (Optional). Top candidates are dequantized in memory to float space. Exact similarity (cosine, dot product, L2) is computed. Final top-k results are returned.
Quantized values get you in the ballpark fast. Dequantized values win the game.
Where Quantization Lives in Modern AI
Quantization isn't just a vector database trick, it's become essential across the AI stack:
Vector Search & Retrieval
ANN libraries like FAISS, Qdrant, and Milvus use quantization as standard practice. When you're searching through millions of embeddings from OpenAI or Cohere, those indexes are almost certainly quantized to keep memory usage manageable and queries fast.
Large Language Models
You've probably encountered quantized LLMs without realizing it:
- Llama 3.1 8B: Original model ~16GB, quantized versions as small as 4GB
- GPT-4 variants: Many inference providers use quantized weights to serve more requests per GPU
- Local AI tools: Apps like Ollama and LM Studio rely heavily on quantized models to run LLMs on consumer hardware
The same float32 → int8 (or even int4) mapping that works for embeddings also works for the billions of parameters in transformer models.
Edge & Mobile AI
Quantization makes the difference between "AI that runs in the cloud" and "AI that runs in your pocket." Every iPhone running on-device Siri or Google Pixel doing real-time translation is using quantized neural networks.
Why This Matters
For large-scale AI systems:
- Smaller models & indexes → fit more in RAM/SSD
- Faster distance calculations → lower query latency
- Often negligible quality loss → still returns the right neighbors
- Democratized access → powerful models on consumer hardware
That's why quantization has become standard across the AI ecosystem, it's an old idea from quantum physics, now essential for modern machine learning at scale.
Conclusion
Here's the uncomfortable truth about modern AI: the algorithms powering your "intelligent" systems are running on deliberately degraded data.
Every embedding lookup in your RAG pipeline, every parameter in your local Llama model, every neural network inference on your phone, they're all using approximations. Numbers that should be precise are rounded to the nearest available bucket. Continuous mathematics becomes digital steps.
And it works better than the alternatives.
This isn't a compromise we've grudgingly accepted, it's the breakthrough that made large-scale AI possible. Without quantization, GPT-4 would require data centers the size of city blocks. Your vector database would collapse under its own weight. AI would still be trapped in research labs.
Planck discovered that nature itself is quantized, energy comes in discrete packets, not smooth curves. A century later, we've learned that artificial intelligence works the same way. Sometimes the most powerful insights come from throwing away information you thought you needed.
The future of AI isn't about perfect precision. It's about finding the right level of imperfection that unlocks everything else.