← Essays

Storytelling Through Feature Engineering: Lossy Compression for Language Models

Feature engineering didn't die. It became lossy compression for language models, turning thousands of raw rows into ten metrics that carry the same story.

·9 min read

I've been thinking a lot about compression lately.

Not the boring kind, gzip, ZIP files, that sort of thing. I mean the interesting kind: taking high-entropy operational data and turning it into a small, opinionated set of signals that an LLM can reason over consistently.

This clicked for me while helping a team feed supplier data into Claude for risk analysis. They had a beautifully detailed dataset, every purchase order, every line item, dates, quantities, prices. Tens of thousands of rows. And they were just… dumping it into the context window.

Token count: ~50,000. Cost: painful. Quality: surprisingly mediocre.

So we tried something different. We compressed each supplier down to a handful of metrics:

  • Price volatility (CV of unit price): 0.82
  • Order rhythm (median days between POs): 12
  • Price trend (last 90d vs prior 90d): −8%
  • Supplier concentration (category spend share): 0.71

Token count: ~20. And the analysis got better, sharper, more consistent, more actionable.

In the LLM era, compression isn't just optimization. It's cognition.

Compression Before Cognition

The thing nobody tells you about LLMs.

LLMs are powerful pattern recognition machines. But they're inconsistent at extracting the same patterns from raw tabular data without explicit structure, especially when the input is long, noisy, or order-dependent.

The problem isn't the model. The problem is we're asking it to do two jobs:

  1. Find the patterns in the data
  2. Reason about those patterns

It's pretty good at #2. It's unreliable at #1.

You can ask an LLM to summarize raw tables, but then you're letting the model decide what matters. Compression with intent flips that: you decide what matters once, encode it, and get repeatable reasoning downstream.

But here's the thing: you already know what patterns matter. You have domain knowledge. You know that supplier volatility is a risk signal. You know that concentration is dangerous. You know what story the data is trying to tell.

So why make the LLM figure it out?

Lossy compression with intent

This is where feature engineering comes back from the dead.

For the past few years, feature engineering has felt kind of… obsolete. Deep learning was supposed to learn features automatically. Manual feature engineering was for people still using XGBoost (no shade, I still use XGBoost).

But LLMs changed the game. Suddenly we're not training models on structured data; we're having conversations about it. And conversations have constraints:

  • Token budgets. Context windows are finite. Every raw transaction costs you.
  • Attention budgets. LLMs need to "attend to" everything in context. More stuff = more diffused attention.
  • Consistency requirements. You want the same analysis every time, not whatever the LLM happens to notice.

This is where compression comes in. But not just any compression, lossy compression with intent.

Let me break that down:

Compression. Many → Few. Take 100 transactions, create 1 number.

Lossy. Information is deliberately destroyed. You're not trying to preserve everything. ZIP files are lossless, you can reconstruct the original. Feature engineering is lossy, you can't get back to the raw transactions, and that's the point.

Intent. You choose what to destroy based on domain knowledge. Coefficient of variation says: "I care about volatility, not absolute values." Recency ratio says: "I care about freshness, not history." Your compression betrays your beliefs about what matters.

A case study in compression

Let's make this concrete. You've got procurement data, 100 transactions per supplier across dates, prices, quantities.

For one supplier, this is maybe 5,000 tokens of context. For 50 suppliers? You've spent most of your context budget just showing the LLM your data.

Now watch what happens when we compress:

Volatility story (Price CV over 90d: 0.08)

  • What it says: "This supplier's pricing is stable"
  • What it destroys: Specific prices, dates, trend direction
  • Token cost: ~5 tokens
  • Why it works: One number triggers reliability reasoning

Temporal story (Median days between orders: 18.5)

  • What it says: "We order from them every 2.5 weeks"
  • What it destroys: Exact dates, quantities, gaps
  • Token cost: ~5 tokens
  • Why it works: The model can immediately assess if the relationship is healthy

Relationship story (Share of category spend, 90d trend: +15%)

  • What it says: "We're consolidating onto this supplier"
  • What it destroys: Absolute spend, other suppliers, category details
  • Token cost: ~6 tokens
  • Why it works: Strategic direction in one number

Now your 5,000 tokens becomes 50 tokens. Your 50 suppliers fit comfortably in a couple thousand tokens instead of 250,000.

Compression ratio: 100:1.

But here's the beautiful part, you haven't just saved tokens. You've made the data more useful. The model doesn't have to hunt for patterns. You've already done the cognitive work. You've encoded the narratives that matter.

Feature engineering as semantic compression

I think about this like MP3 encoding.

MP3s throw away frequencies humans can't hear. JPEG throws away color variations our eyes can't distinguish. These aren't bugs, they're features. They exploit properties of human perception to achieve massive compression while preserving what matters.

Feature engineering does the same thing for LLM reasoning. You're throwing away details the model doesn't need to preserve the signals it does.

The coefficient of variation is a perfect example. It's literally three pieces of information:

  • Mean
  • Standard deviation
  • The ratio between them

But that ratio tells a story. CV of 0.1? Stable. CV of 0.5? Volatile. CV of 1.2? Chaos.

You've compressed maybe 100 data points into a single number, but you haven't lost the narrative. You've concentrated it.

How to compress with intent

The hard part isn't the math. Calculating CV is trivial. The hard part is deciding which story to tell.

Here's my framework:

  1. Start with the question. What's the LLM supposed to help with? Risk assessment? Demand forecasting? Optimization?
  2. Identify the patterns that matter. For risk: volatility, concentration, relationship health. For forecasting: seasonality, trends, external factors. For optimization: efficiency, utilization, bottlenecks.
  3. Build features that compress toward those patterns. Each feature should answer one specific question in the most token-efficient way possible.
  4. Ruthlessly destroy everything else. If it doesn't serve the narrative, it's noise. Cut it.

Treat your feature set like a schema, not an ad-hoc summary. Name each metric, define its time window, define its units, and version it. You're building a contract between raw data and reasoning.

A practical workflow looks like: raw data → exploratory analysis → feature schema → compressed snapshots → LLM reasoning → cite back to metrics → keep raw for audit.

Let me give you examples across different use cases:

For demand forecasting:

  • Seasonal concentration index (12mo): What % of annual demand happens in peak quarter?
  • Trend acceleration (90d/90d): Current quarter avg / previous quarter avg
  • Volatility (12mo): CV of monthly demand

For supplier risk:

  • Price stability (90d): CV of unit prices
  • Delivery consistency (90d): CV of lead times
  • Concentration (category-level): Gini coefficient across suppliers (inequality of spend distribution)

For inventory optimization:

  • Stock-out frequency (90d): % of weeks with zero inventory
  • Turnover trend (90d/90d): Current quarter turns / historical avg
  • Demand correlation (12mo): How linked is this SKU to others?

Each of these is 5–8 tokens. Each tells one clear story. String together 10–15 of them and you've given the LLM a complete picture in under 150 tokens.

The interpretability dividend

Here's a bonus: compressed features are explainable.

When the LLM says "Supplier A is high risk because CV=0.82 and concentration=0.71," a human can actually validate that reasoning. The features have business meaning.

Compare that to: "Based on the transaction history…" [cites row 47, 103, and 891]. Good luck auditing that.

Compression creates a shared analytical language between you, the LLM, and your stakeholders. Everyone can reason about "price volatility" or "demand seasonality." Not everyone can reason about raw transaction logs.

When NOT to compress

Look, I'm obviously bullish on compression. But there are times when you shouldn't do it:

  • Exploratory analysis. If you don't know what patterns matter yet, don't compress. Let the LLM explore the raw data first. But once you've explored? Compress what you learned.
  • Anomaly detection. Sometimes the weird stuff is in the details. If you're looking for outliers or fraud, you might need the raw transactions.
  • Regulatory requirements. Some contexts require full audit trails. Compression for analysis, preserve raw data for compliance.

But even in these cases, I'd argue: do your exploration, find your patterns, then compress for production use.

The compression mindset

Here's how I think about this now:

You're not doing "feature engineering" in the old machine learning sense. You're encoding narratives for language model consumption.

Each feature is a codec. It compresses one story, volatility, trend, concentration, whatever, into the most token-efficient representation possible.

String together multiple features and you're giving the LLM multiple parallel stories. It can synthesize across them, find interactions, generate insights.

But you've done the hard work. You've curated the data. You've decided what matters. You've compressed with intent.

The compression quality checklist:

  • Can you explain this feature in one sentence?
  • Does it save at least 90% of tokens?
  • Does it preserve the signal that matters for your use case?
  • Would ten analysts compress the same way? (That's standardization)

If you can't check all four boxes, keep iterating.

This changes everything

We spent the 2010s worrying about "big data," how to store it, process it, analyze it at scale.

The 2020s are about "right data," what to feed these powerful but token-hungry reasoning engines.

LLMs reward curation over volume. They reward compression over comprehensiveness. They reward you for doing the cognitive work up front.

Feature engineering isn't dead. It's just evolved. We're not engineering features for gradient boosted trees anymore. We're compressing narratives for language models.

And honestly? I think this is more interesting. Because you're not just optimizing for model performance, you're optimizing for understanding. For interpretability. For that moment when the LLM, your stakeholder, and you are all reasoning about the same compressed representation of reality.

That's the kind of compression I can get excited about.

Pick one domain object, define 8–12 metrics, and run an A/B test: raw rows vs compressed snapshot. If the compressed version doesn't produce more consistent reasoning, your features aren't telling the right story yet.

What's your one metric that carries the story, your CV, your seasonality index, your concentration score?