Cluster Labels Are a Contract

When a clustering taxonomy is being worked against by humans, the labels are a contract. predict() can't preserve that. Notes from a production item classifier I built at Gibson Consulting, and the four things shippable clustering actually needs.

Jun 9, 2026·15 min read

The third post in a rustcluster trilogy. The first was about the eleven-week attempt to outrun BLAS. The second was about cutting peak Python memory in half. This one is about the part of clustering that nobody writes about because it isn't fast and it isn't clever. It's the part that survives contact with humans.

A client onboarding looked like this.

A procurement team at a manufacturing company would send us two years of historical spend data: roughly half a million line items, each a short product description, a supplier name, a sometimes-applied commodity code, sometimes a unit of measure, sometimes a price, sometimes nothing else. We would normalize, embed, and cluster the corpus into a working taxonomy: a few thousand clusters across four tiers, from broad sector down to item-class. Our consultants would walk through the taxonomy and apply real, human-readable labels. "Lubricating oils, hydraulic, ISO VG 46." "Industrial fasteners, hex bolts, sub-3-inch." "Specialty resins, polyamide grade 6/6." Then they would do their actual job, which was finding consolidation opportunities, supplier rationalizations, contract gaps, the things procurement consultancies sell.

Their analysis was anchored to those labels. The dashboard the client logged into showed those labels. The savings narrative the engagement defended depended on those labels staying put.

One month later, the next drop would arrive. Forty thousand more line items. New products the client had started ordering. New suppliers. Mistakes in the new descriptions that the old descriptions didn't have.

We had to add the new items to the taxonomy.

We could not move the taxonomy underneath them.

That second sentence is the one this post is about. It is, in my experience, the single most underappreciated requirement in production clustering, and it is the reason the slotting subsystem in rustcluster exists.

What `predict()` can't do

The instinct of every ML engineer encountering this problem is the same: train once, call predict() for new data. scikit-learn's API has trained us to think of clustering as a fit-then-predict object, the same shape as a classifier. The cluster centers are fixed once fit() is done. New points get assigned to the nearest center.

This is correct, as far as it goes. predict() does exactly what it says. The problem is that doing only what predict() does is not sufficient for a production deployment.

There are four problems, and they compound.

The fitted model isn't a deployable artifact. A scikit-learn KMeans after fit() is a Python object with NumPy arrays, depending on a specific version of NumPy, scikit-learn, and Python. The only "save" mechanism is pickle, which is not portable across versions and is not safe to load from untrusted sources. To deploy this clusterer, the entire library stack has to come with it. Versioning means versioning the environment.

There is no honest way to say "I don't know." predict() always returns a label. Every point is assigned to its nearest cluster, regardless of how far away that cluster is. In a procurement taxonomy where the corpus is genuinely heterogeneous (some items are well-covered by clusters; some are weird outliers the model has never seen), the right answer for a meaningful fraction of inputs is "this doesn't belong to any of these clusters." predict() cannot express that.

There is no signal for when to retrain. If the world shifts (the client starts buying new categories of products, a new business unit comes online, the descriptions change shape because a new ERP system rolls out), predict() keeps cheerfully assigning new points to old clusters. The model degrades silently. The only way to know is to look at the assignments and notice they look wrong, which is an after-the-fact discovery, usually made by a downstream user whose Slack message starts with "hey, quick question."

The hierarchy is on the side. Real taxonomies nest. Our four-tier hierarchy went from broad sector to item-class. A predict() against a flat clustering returns a single label. To get a multi-level answer, you fit multiple separate clusterings and write glue code that walks them in order. That glue code lives in the consultancy's repo, not the library, and it's bespoke per engagement.

None of these are exotic problems. They're what every team running clustering in production hits in their first quarter. The standard solution is "we'll just build that ourselves," and most teams do. But the result is that every team's slotting layer is slightly different, slightly under-tested, and slightly entangled with the rest of their pipeline.

The argument of the rest of this post is that this should be a library concern, and that it should be designed around a single object that captures all four problems at once: the snapshot.

The snapshot is the artifact

from rustcluster import KMeans, ClusterSnapshot
 
model = KMeans(n_clusters=2_400).fit(X_historical)
snapshot = model.snapshot()
snapshot.save("taxonomy_v1/")

taxonomy_v1/ is a directory. Inside it: a safetensors file with the centroids, a JSON file with the metadata. For a 2,400-cluster, 128-dimensional snapshot that's on the order of a few megabytes. For a 50-cluster, 128-dimensional snapshot it's about 50 KB.

This is the thing the consultancy actually ships. Not the trained model. Not the Python environment. A directory of two files that, given the rustcluster library at any version >=0.6, will assign new data to the exact same clusters the consultants labeled.

The version compatibility matters more than it sounds. Snapshot v1 (the original format) and v2 (which adds calibration data; we'll get to that) are forward-compatible: a v1 snapshot loads cleanly in the v2 reader, with calibration-dependent fields read as None. The consultancy can upgrade the library without invalidating engagements in flight. The client's dashboard does not blink.

In a typical Gibson engagement, every snapshot got a version number tied to the engagement's analysis cycle. taxonomy_v1 was the initial fit. taxonomy_v2 was the next-quarter refit, after enough drift had accumulated and consultants had a labeling cycle scheduled. Each version sat in a git-tracked artifact store. Each one was, in principle, exactly reproducible from the snapshot file plus the rustcluster version on disk. That's a contract with the future you that has to reproduce the result for an auditor or a client question.

# In some monthly job, against the existing engagement's taxonomy
snapshot = ClusterSnapshot.load("taxonomies/client_acme/v1/")
labels = snapshot.assign(X_monthly_drop)

The labels are stable across months because the snapshot is stable across months. The thing predict() implicitly conflates (the trained model and the deployable model) is here a single deliberate object you can name, version, ship, and defend.

Calibrated rejection: not every point belongs

In any real taxonomy, some clusters are tight and some are diffuse.

The tight ones are the ones the model figured out cleanly. "Lubricating oils, hydraulic, ISO VG 46" might have ninety items in it, all variations on the same handful of supplier SKUs, all close to the centroid. The diffuse ones are the ones the model could only half-figure out. "Industrial chemicals, miscellaneous" might have two hundred items in it from a dozen distinct sub-categories, far from the centroid, far from each other.

When new items arrive, the appropriate distance for rejection is different in each cluster. A new "lubricating oil" item that lands two standard deviations away from the centroid is suspicious; it probably isn't lubricating oil at all and got misclassified. A new "industrial chemicals, miscellaneous" item that lands two standard deviations away is unremarkable; that cluster is two standard deviations wide.

A single global rejection threshold cannot handle both cases. Set it tight and you reject everything from diffuse clusters, including items that belong there. Set it loose and you accept obvious mistakes from the tight clusters. The procurement teams using the taxonomy notice both failures immediately, in different ways. Tight rejection looks like the taxonomy is too brittle; loose rejection looks like the taxonomy has been polluted.

The fix is per-cluster thresholds. The snapshot has a calibration step that collects them.

snapshot.calibrate(X_train)
 
result = snapshot.assign_with_scores(
    X_new,
    adaptive_threshold=True,
    adaptive_percentile="p10",
)
result.labels_       # -1 for rejected points
result.confidences_  # [0, 1)
result.rejected_     # bool mask

calibrate() runs the training data through the snapshot one more time, collects per-cluster confidence distributions (the quantiles of the assignment confidence for points that landed in each cluster), and stores them inside the snapshot. From that point forward, assign_with_scores() with adaptive_threshold=True will reject new points that score below the 10th percentile (or whatever percentile you ask for) of their assigned cluster's training distribution.

The tight clusters get tight thresholds automatically. The diffuse clusters get diffuse thresholds automatically. Nobody had to hand-tune anything. The rejection rate ends up at something like 5-10% on a well-trained taxonomy applied to a same-distribution monthly drop, which is approximately right; it isolates the items that need a human look from the items that can flow through automatically.

For procurement workflows this is the thing that takes the engagement from "we re-cluster every month because we can't trust the assignments" to "we accept ~92% of the monthly drop automatically and surface the rest to a consultant." The 8% is what an analyst should be looking at anyway. The 92% is what the analyst should be free to ignore.

There's a more aggressive option for cluster shape called Mahalanobis boundary mode. Calibration also collects per-cluster diagonal variances. When you ask for boundary_mode="mahalanobis", the assignment metric becomes the diagonal Mahalanobis distance, which gives elongated clusters appropriate width along their elongated axis. For procurement taxonomies this matters in clusters like "consumables, varied," where the items genuinely span a wide range along one or two axes but are tight along others. Voronoi assignment over-rejects them. Mahalanobis assignment doesn't.

I don't recommend Mahalanobis as the default. The variances need to be well-estimated, which requires enough training points per cluster, which isn't always the case in a long-tail taxonomy. But when it applies it works cleanly.

Drift detection: when to stop adding and start rebuilding

Slotting is an additive operation. New items go into existing clusters. The taxonomy doesn't change.

This is the right default. But over enough time, the assumption that the taxonomy still describes the corpus breaks down. The client opens a new business unit that buys things the historical data never contained. The supplier landscape shifts; the descriptions change shape. The categories that were once 80% of spend collapse to 30%, and the categories that were once 5% explode.

At some point, the right move is to retrain. Add a tier, split a cluster, retire a dead one. The question is when.

Calendar-based retraining (every quarter, every year) is wrong in both directions. Too often is wasteful; the labels move when they didn't need to, the engagement work churns. Too rare is dangerous; the model is silently drifting and the assignments are silently degrading.

The right cadence is data-driven. The snapshot has a drift_report() method that produces it.

report = snapshot.drift_report(X_recent)
 
report.global_mean_distance_   # vs training baseline
report.relative_drift_         # per-cluster
report.rejection_rate_         # how many points failed adaptive thresholds
report.kappa_drift_            # vMF concentration shift (spherical clusters)
report.direction_drift_        # centroid direction shift (spherical clusters)

The interesting fields are the per-cluster ones. relative_drift_[i] is (new_mean_distance[i] - fit_mean_distance[i]) / fit_mean_distance[i]. It tells you that cluster i is being assigned points that are systematically farther from its centroid than the training data was. A single number per cluster. Easy to monitor over time. The cluster that's drifting hardest is the cluster you should look at first.

For spherical clusters (the embedding case, where you've L2-normalized and you assign by cosine similarity), the natural drift measures are different. kappa_drift_ measures the shift in the cluster's von Mises-Fisher concentration parameter; a falling kappa means the new points are spreading out from the centroid direction. direction_drift_ measures how much the centroid direction itself has shifted, computed as 1 - cos(old_centroid, new_mean_direction). These two together tell you whether the cluster is dilating, rotating, or both.

The Gibson cadence ended up looking like this. Every monthly drop produced a drift report against the current taxonomy. The report went into a small dashboard. When relative_drift_ exceeded a threshold for a meaningful fraction of clusters (the threshold tuned per engagement; usually somewhere around 0.3 for a procurement workload), we'd schedule a relabeling cycle. When it didn't, we'd let the taxonomy keep going and just slot the new data.

The clients liked this. They could see, before we asked them to approve a relabeling, the data that justified it. The relabeling stopped being a calendar event and started being a measured response to something the world was doing.

Hierarchical slotting

A four-tier taxonomy doesn't reduce to a flat clustering. The reasons are partly structural and partly practical.

Structurally, the right number of clusters at the top of a procurement taxonomy is small (maybe twenty broad categories), and the right number of clusters at the bottom is large (a few thousand item-classes). A flat clustering at the bottom is too granular to be useful at the top. A flat clustering at the top is too coarse to be useful at the bottom. The taxonomy has to be both at once, in a tree.

Practically, the labeling workflow is hierarchical. A consultant labels the top tier first, then walks down. Re-labeling a sub-tier doesn't require touching the broader tier above it; the broader tier was decided once and now stays put. Labels at higher tiers are stickier than labels at lower tiers; the hierarchy reflects that.

rustcluster's HierarchicalSnapshot ships two levels. A root snapshot does the broad assignment; per-cluster child snapshots refine within each broad cluster's slice. You call assign() once and get both labels back.

from rustcluster.experimental import HierarchicalSnapshot
 
hier = HierarchicalSnapshot.build(X_train, root_model, n_sub_clusters=10)
result = hier.assign(X_new)
hier.save("hierarchy/")

Two levels is the v1 of the library. Four-tier production taxonomies compose by stacking: a HierarchicalSnapshot at the top, then a HierarchicalSnapshot per leaf of that, and so on. The pattern works; it's just more bookkeeping than the library makes you do today. The next iteration of the library should generalize to arbitrary depth, with a single object holding the whole tree. The Gibson taxonomy is the use case driving that work.

The reason hierarchical assignment matters in production isn't just that it returns multiple labels. It's that the rejection logic is also hierarchical. A new item that's well-explained at the top tier but doesn't fit any sub-cluster gets an honest "top-tier yes, sub-tier rejected" answer. The consultant sees it land in the right broad bucket but flags as needing sub-classification. That's the right human-in-the-loop hand-off.

A flat clustering with a single rejection threshold can't produce that signal. You get an assignment or a rejection. The middle case ("I know what this kind of thing is, I just don't know which specific kind") is exactly the case humans should be looking at, and it's exactly the case the flat-rejection model loses.

The production posture

If I had to compress this whole essay into one observation, it would be that predict() is a function and what production clustering needs is a posture.

The posture has four ingredients. A snapshot you can ship, version, and defend, separate from the trained model and the library stack. A calibration step that turns the snapshot into something with per-cluster thresholds, so rejection is honest. A drift report that tells you when to retrain instead of running on a calendar. A hierarchy, because real taxonomies aren't flat.

None of these are novel. Adaptive thresholds are well-understood in the signal-processing literature. Mahalanobis distance is from 1936. von Mises-Fisher distributions are decades old. Drift monitoring is the bread and butter of every production ML observability tool. Hierarchical clustering is in scikit-learn. The thing that's missing is a single object that holds all four ingredients and exposes them as one coherent API. That's the snapshot.

The reason it matters has nothing to do with the library and everything to do with the people on the receiving end. A procurement engagement is a relationship with consultants in the field, doing real analysis, defending real savings numbers, against labels they applied with care. Those labels are a contract. Cluster boundaries are an implementation detail. The contract is what the dashboard shows. When the implementation detail moves, the contract breaks, the dashboard changes, the consultant's analysis silently invalidates, and the engagement loses credibility one cluster at a time.

predict() does not honor that contract. The snapshot does, by being designed to.

This is what I think most clustering libraries (and most teams using them) have backwards. They think of predict() as the production interface and slotting as an optimization. The opposite is true. predict() is a debugging convenience. Slotting is the interface. The model in memory is what you used to build the snapshot. The snapshot is what you actually ship.

If you take one thing from this essay: when you fit a clustering model that humans are going to label and work against, treat the snapshot as the durable artifact and the trained model as scaffolding. Use predict() to test that the snapshot reproduces the in-memory model's behavior on the same data. Then put the trained model away and run production off the snapshot from then on. The engagement will be more stable, the relabeling cycle will be data-driven, and the consultants will trust the system.

That, in the end, is what production clustering is for.

Thanks

To the consulting teams I worked with at Gibson Consulting, who taught me what taxonomy stability actually means in the field, and what happens when you don't have it. The four-tier item taxonomies they built were the use case that made every architectural decision in the rustcluster snapshot subsystem concrete.

To the rustcluster early users who hit the assignment-rejection problem in their own production workflows and reported it instead of routing around it. The calibration API exists in the shape it does because of that feedback.

The library is at github.com/mfbaig35r/rustcluster, v0.7.0 and later. The slotting subsystem (snapshot, calibrate, drift_report, hierarchical) is in the standard rustcluster namespace and rustcluster.experimental. The two earlier essays in this trilogy: The Wall on the eleven-week attempt to outrun BLAS, and How rustcluster Learned to Live in 12 GB on cutting peak Python memory in half on Databricks.

What predict() can't do