← Essays

How rustcluster Learned to Live in 12 GB

A previous essay explained where your Databricks node's 28 GB goes. This one's about what eats the 12 GB that's left, and four small library features that bring the bill down from 8-9 GB to 1.5-3 GB on the canonical embedding-clustering workload.

·16 min read

A previous essay explained the 28 GB. This is about the 12 GB that's left.

The first time the notebook OOMed I assumed I'd done something dumb. The second time I went and read the Spark UI. The third time, after restarting the cluster and watching the driver's memory climb past 9 GB during a PCA fit, I sat with it.

The workload was small by ML standards. 312,000 supplier embeddings, 1536 dimensions each, from text-embedding-3-small. Project to 128 dimensions, then cluster into roughly fifty commodity buckets. The math is unremarkable. So is the data: about 1.9 GB of f32 floats. On a 28 GB Databricks driver, "I need to project two gigabytes of data into a smaller space" should not be a hard problem.

But it was. The previous essay walked through why: of the 28 GB the cluster advertises, the JVM takes ~14-16 GB and never gives it back, leaving maybe 12 GB for Python. The numbers are tight. They get tighter when you start doing real work.

This essay is about what that real work actually cost, where the bytes ended up, and the four small library features that turned the rule-of-thumb workarounds at the end of the last essay into something the library does for you.

The 12 GB before the work began

The 12 GB Python budget is gross, not net. Before I get to write code, the Python interpreter has taken ~500 MB. Pandas and pyspark add another ~300 MB once imported. Numpy adds a little more. Call it ~10 GB of usable headroom on a fresh driver.

The workflow has four stages:

  1. Pull 312K embeddings out of Spark.
  2. Fit PCA to learn the 1536 → 128 projection.
  3. Transform all 312K embeddings through the projection.
  4. Cluster the reduced embeddings.

Stage 4 was never the problem. Reduced data is 312K × 128 × 4 bytes = ~160 MB. Spherical K-means in that space runs in single-digit seconds and takes maybe 50 MB of model state.

The other three stages were the problem.

Here's the peak Python memory I measured for each, on the pipeline as it existed before v0.7.0:

StagePeak PythonWhat's eating it
df.toPandas() of embedding column~5 GBPython list-of-lists + numpy copy coexist briefly
numpy embeddings only, after del pdf["embedding"]~2 GBf32 array of 312K × 1536
PCA fit on a 50K sample+0.6 GBfaer centered matrix (f64, 50K × 1536)
PCA transform on all 312K, naive+3.8 GBfaer centered matrix (f64, 312K × 1536)
Reduced output (f64, 312K × 128)+0.3 GBthe actual deliverable

The total, naively, is over the budget. Two of those numbers don't co-occur, but Python's garbage collector is not synchronous and the JVM is sitting on its half-share, so the peak you actually hit during the transform stage is closer to 8-9 GB. Past 9 GB on a driver with ~10 GB of headroom is the OOM zone.

What I had to do, to make this work at all, was a manual choreography: extract embeddings carefully, fit on a sample, transform in a loop of 50K-row chunks, gc.collect() between every stage, watch the memory profile in the cluster UI. The pipeline worked. It also took three days to make work, and the brittleness was a tax on every subsequent run.

The library should do this.

Where the bytes actually went

Before adding features, I wanted to know which intervention would actually move the needle. So I instrumented the pipeline with tracemalloc and watched.

The biggest surprise: PCA wasn't the worst offender.

A naive df.toPandas() on a DataFrame with an array-of-floats embedding column doesn't give you a numpy array. It gives you a Pandas DataFrame whose embedding column is a Series of Python lists of Python floats. Each list is ~13 KB at d=1536 (the float objects alone, before the list overhead). 312,000 lists is roughly 4 GB of pure Python overhead, in addition to whatever numpy array you build out of it. The list-of-lists exists alongside the numpy array until the GC catches up. Briefly, you're paying for both.

Once I had the numpy array and the original column was gone, the f32 embeddings themselves were a friendly 1.9 GB. The cost of toPandas wasn't the data. It was the representation of the data.

The PCA stage was the second offender, but it was different in character. faer, the matrix library underneath rustcluster's PCA, operates on Mat<f64> by default. To project f32 embeddings through a PCA fitted on f32 embeddings, the old code built an f64 centered matrix: n × d × 8 bytes. For 312K rows at d=1536, that's 3.8 GB allocated, used for one matrix-multiply, then dropped. The library was upcasting the entire dataset to a more precise dtype than the input deserved.

Two distinct memory problems. Two distinct fixes.

Intervention 1: Stop letting Python touch the embeddings

If the cost of toPandas was the Python list-of-lists, the answer was to not let Python see lists.

Spark has an iterator API, toLocalIterator(), that yields rows one at a time and lets the JVM release each Arrow batch as Python consumes it. Combined with a pre-allocated numpy array that doubles its capacity on overflow, you can stream a DataFrame's embedding column into a numpy buffer without ever holding a Python list-of-lists in memory.

The new rustcluster.utils.extract_embeddings_from_spark does this:

from rustcluster.utils import extract_embeddings_from_spark
 
embeddings, metadata = extract_embeddings_from_spark(
    df,
    embedding_col="embedding",
    metadata_cols=["supplier_id", "commodity"],
    dtype=np.float32,
)

The helper returns a numpy array of shape (n, d) and a pandas DataFrame of the metadata columns, row-aligned with the embeddings. Embedding rows are written directly into a pre-allocated numpy buffer. The buffer starts at max(1024, d) rows and doubles on overflow. The Python heap never holds two copies; the JVM releases each Arrow batch as soon as the iterator has consumed it.

The first stage's peak Python memory dropped from ~5 GB to ~2 GB. Most of what's left is the embedding array itself, which you actually need.

Pyspark and pandas are lazy-imported (the helper is in rustcluster.utils, not the runtime path), so the rest of the library still works in environments where neither is installed.

Intervention 2: PCA's principal directions stabilize fast

PCA fit on the full 312K × 1536 dataset builds a centered matrix of 312K × 1536 × 8 bytes = 3.8 GB. That's the OOM trigger if you try to fit on everything.

You don't need to.

Randomized PCA's accuracy depends on the covariance structure of the data, not its row count. The covariance of 312K embeddings is statistically indistinguishable from the covariance of a 50K random sample once you're past ~5K rows. The principal directions you extract from the sample are the same ones you'd get from the full data, to single-precision rounding. The literature on this is several decades deep; Halko-Martinsson-Tropp (the algorithm rustcluster uses) explicitly handles the over-sampling regime.

EmbeddingReducer now has a fit_sample_size constructor parameter:

reducer = EmbeddingReducer(target_dim=128, fit_sample_size=50_000)
reducer.fit(embeddings)  # internally samples to 50K if n > 50K

If the input has more than fit_sample_size rows, the reducer samples (using random_state for reproducibility) before calling the underlying Rust fit. If the input is smaller, the parameter is a no-op.

This dropped the fit stage's peak memory from a potential 3.8 GB (had you tried to fit on the full data) to a guaranteed 600 MB. More importantly, it took the fit stage off the critical path of memory budgeting. You can pick any sample size you want; the math says 50K is enough.

A version of this was in the rules of thumb at the end of the previous essay: "Native extensions (Rust, C) allocate invisibly. faer's 3.8 GB matrix doesn't show up in Python profiling tools but still counts against your process limit." The intervention is the same intervention you'd have made manually. The difference is that now the library makes it for you, with a single parameter, reproducibly.

Intervention 3: Don't build the whole f64 centered matrix at once

Sample-fit fixed the fit stage. The transform stage still had to process all 312K rows.

A 312K × 1536 × 8-byte centered matrix is 3.8 GB. There is no way to keep the full matrix in working memory on a 12 GB Python heap and also leave room for everything else. The fix is straightforward: process the data in row-blocks, materializing one block's centered matrix at a time.

EmbeddingReducer.transform and fit_transform now take a chunk_size parameter:

X = reducer.transform(embeddings, chunk_size=50_000)

The helper loops, transforms a chunk at a time, and concatenates the results. The per-chunk centered matrix at chunk_size=50_000 is 600 MB. The previous chunk is freed before the next one allocates. Only one block is live at any moment.

The output is the same numpy array you'd have gotten from the unchunked path. The numerical result is byte-equivalent (the centered-matrix math is deterministic and we don't batch across chunks). The only thing chunking costs is a small amount of Python overhead per chunk and the time to vstack the results, both negligible compared to the matmul.

The Matryoshka reduction method skips chunking entirely. Matryoshka is a column slice (take the first 128 columns of a 1536-dim embedding from a model trained for it), not a matmul. The whole "process" is a sub-millisecond array view. There's no centered matrix to budget for, and the chunking overhead would dominate.

After Phase 1 (sample-fit) and Phase 2 (chunked transform), the workflow fits inside the 12 GB Python budget with room to spare. Most of the structural memory problems are solved. The remaining work is incremental.

Intervention 4: Embedding workloads don't need f64 precision

The 600 MB per-chunk centered matrix at f64 is the next thing to attack, and the obvious lever is dtype.

PCA done in f32 produces a projection that's numerically indistinguishable from PCA done in f64, for embedding data. The reason is upstream of the math. OpenAI's text-embedding-3 models, Cohere's embed-v3, Voyage, every Matryoshka-trained model, all of them produce f32 outputs natively. You're already at f32 precision when the data hits your code. Upcasting to f64 inside PCA isn't recovering information; it's just doubling the memory footprint of every intermediate matrix.

The previous behavior of EmbeddingReducer was to upcast f32 inputs to f64 at the PyO3 boundary, run PCA in f64, and return f64 outputs. Even when the input was unambiguously f32. The argument for it was simplicity: one code path, one canonical dtype.

The new behavior, in v0.7.0, is to dispatch on input dtype.

embeddings = embeddings.astype(np.float32)
X = reducer.fit_transform(embeddings, chunk_size=50_000)
# X.dtype == np.float32

If the input is f32, the centered matrix stays in f32 through the PCA hot path. The output is f32. If the input is f64, the centered matrix stays in f64. Output is f64.

The implementation work was less than I expected. The PCA algorithm itself doesn't change; faer supports Mat<f32> directly. The new functions (fit_pca_f32, transform_f32, compute_pca_f32, project_data_f32) mirror the existing f64 ones with the matrix type swapped. About 280 lines of Rust, no algorithm changes, no save format changes.

The savings are clean. The per-chunk centered matrix drops from 600 MB to 300 MB. The PCA fit centered matrix drops from 600 MB to 300 MB. The reduced output drops from 320 MB to 160 MB. Cumulatively, the f32 fast path saves about another 500 MB across the workflow, on top of what the chunking and sampling already bought.

One contract change worth flagging. EmbeddingReducer.transform(X_f32) used to return f64; it now returns f32. If you had downstream code that depended on the silent upcast (perhaps doing X.dot(weights_f64) and relying on the upcast to avoid a precision warning), you'll get a different result type now. The fix is to cast explicitly: reducer.transform(X).astype(np.float64). The contract change is documented in the v0.7.0 release notes. I'm mentioning it again here because release notes don't get read.

Where the 12 GB goes now

Stacked together, the four interventions look like this on the canonical 312K × 1536d → 128d workload:

Stagev0.6.x peakv0.7.0 peakMechanism
Extract embeddings from Spark~5 GB~2 GBArrow streaming, no Python list-of-lists
PCA fit on 50K sample+0.6 GB+0.3 GBf32 centered matrix
PCA transform, per chunk+3.8 GB+0.3 GBchunked + f32
Reduced output (312K × 128)+0.3 GB+0.16 GBf32
Total Python peak~8-9 GB~1.5-3 GB

The whole thing now runs on the 28 GB driver without manual chunking, without gc.collect(), without dropping the embedding column from a Pandas DataFrame to avoid the double-allocation peak. The canonical example from the rustcluster docs:

from rustcluster.utils import extract_embeddings_from_spark
from rustcluster.experimental import EmbeddingReducer, EmbeddingCluster
import numpy as np
 
embeddings, pdf = extract_embeddings_from_spark(
    df,
    embedding_col="embedding",
    metadata_cols=["supplier_id", "commodity"],
    dtype=np.float32,
)
 
reducer = EmbeddingReducer(target_dim=128, fit_sample_size=50_000)
X = reducer.fit_transform(embeddings, chunk_size=50_000)
 
model = EmbeddingCluster(n_clusters=50, reduction_dim=None).fit(X)

The pipeline that used to need three days of manual tuning now reads like a beginner tutorial. The library wears the discipline so the caller doesn't have to.

Why none of these is a single trick

The thing I want to point at is that no individual intervention was the win. The Arrow extraction saves 3 GB but only at the front of the pipeline. The sample-fit saves another 3 GB but only at the fit stage. The chunked transform caps a 3.8 GB allocation but doesn't reduce it. The f32 path halves what's left.

Each one addresses a specific stage of the data flow. None of them changes the underlying algorithms. None of them is novel. The randomized PCA paper is from 2011. Arrow's toLocalIterator has been around since Spark 2.x. Chunked transforms are a numerical-computing convention older than I am. f32 vs f64 dispatch is type-system table stakes for any library that takes dtypes seriously.

What changed is that all four are now wired into the library, with sensible defaults, in a way that composes without the caller having to think about it. The win is not the cleverness of any one piece. The win is the composition.

This is a useful pattern to notice. Memory-optimization work has two failure modes. One is the "novel allocator" pattern: write an arena, write a custom matrix layout, do something exciting and architecturally invasive. The other is the "just throw hardware at it" pattern: ask for a bigger driver, restart the cluster, kick the can. The pattern that worked here was the boring third option: profile the pipeline, find the stages where bytes were being moved or doubled unnecessarily, fix each one in isolation, ship them as composable parameters.

Each intervention is short. The sample-fit parameter is fifteen lines of Python. The chunked transform is thirty. The Arrow extraction is sixty lines of Python plus a pandas dependency. The f32 dispatch is the largest piece at about 280 lines of Rust plus a PyO3 boundary update. Total new code: under five hundred lines. Total savings: roughly 5-7 GB of peak Python memory on the canonical workload.

What we deliberately didn't do

There were larger refactors on the table that I declined to do, and one I think is worth saying out loud.

The EmbeddingReducer save format stays f64 even when the fit ran in f32. The components and mean are small (~750 KB at d=1536, target=128), and migrating the on-disk format would have meant bumping the format version, writing a v1-to-v2 reader path, and updating the related pickle format on EmbeddingCluster. The marginal win was about 375 KB on disk; the migration cost was hours of careful work. We kept the storage canonical.

In-place transform (computing the projection without materializing the centered matrix at all) and streaming PCA fit (computing the covariance incrementally over blocks) are both on the future-work list. Either one would shave another factor of 2-4× off peak memory. Neither is needed today; chunking + sample-fit + f32 already brings the workflow under the budget, and the deeper refactors would have been their own multi-week projects with their own risks. The 80/20 of the memory win was in the four small interventions.

A Spark UDF wrapper that runs clustering on executors instead of the driver would be useful for the case where the driver is small and the executors are large. We didn't build it. The cases where it would matter are also cases where you'd reach for a different tool entirely.

The general rule I tried to follow: ship the smallest intervention that meets the success criterion, defer the larger refactor until a user pulls on it. That's not always the right move (sometimes the larger refactor is genuinely smaller in the long run) but on this work, where every intervention had a clean stage-of-pipeline scope, it was the right call.

A note on knowing where the bytes go

The previous essay closed with eight rules of thumb for working inside Databricks's memory split. Three of those rules turned directly into the four interventions described here. "Process data in chunks when possible" became chunk_size. "Native extensions allocate invisibly" became sample-fit. "f32 vs f64 matters" became the dtype dispatch. "Python lists of floats use 7x more memory than numpy" became the Arrow extraction helper.

The interventions are not novel. The interventions are what every careful person has done by hand, every time, in this corner of the workflow. The work was making them composable defaults that you can opt into with a single parameter, on a library where you might have been doing the work by hand the week before.

The thing I keep coming back to, the longer I work in this space, is that most of the leverage in numerical Python is at the data-flow level, not the algorithm level. The PCA algorithm in rustcluster v0.7.0 is identical to the one in v0.6.0. The clustering algorithm hasn't changed in three releases. What changed is the path the data takes through the library. Where it lives, what dtype it's in, how big a chunk you touch at once, whether the JVM is holding a copy. Those are the bits you can move. The math, mostly, is the math.

That's the rule of thumb I'd add to the list. Optimize the data flow, not the algorithm. Most of the time, the math doesn't need help.

Thanks

To everyone who hit OOM in production and reported it instead of giving up. That feedback is the only reason this work happened at the priority it did.

To Sarah Quinones, again, for faer. The matrix library that supports both f64 and f32 with the same API is the reason the f32 fast path was a weekend's work and not a two-month refactor.

To the OpenAI text-embedding-3 team and the Matryoshka representation learning literature, for making it possible to do all this clustering at d=128 in the first place. The biggest memory win is the one you never had to make because you started with smaller embeddings.

The library is at github.com/mfbaig35r/rustcluster, v0.7.0 and later. The previous essay on Databricks memory budgets is at How Memory Actually Works on Databricks. The companion piece, on a different lesson in optimization humility, is at The Wall: Eleven Weeks Against Forty Years of GEMM.