← Essays

Data as a Public Utility: The Problem We Pretend Doesn't Exist

Why public data deserves public infrastructure, and how patents on basic ML techniques applied to taxpayer-funded datasets are gatekeeping disguised as innovation.

·17 min read

I'm a data scientist, more or less. I build systems that process information. Recently, I started looking at the HTS classification problem, how goods get assigned tariff codes when they enter the United States. From a pure machine learning perspective, it seemed straightforward enough: semantic search over hierarchical data, some retrieval-augmented generation, maybe some clever ranking. Standard stuff.

Then I looked at the competitive landscape.

Patents. Everywhere. Patents on semantic search for tariff data. Patents on "methods" for retrieval-augmented generation in trade classification. Patents on basic information retrieval techniques applied to public datasets.

And I realized something uncomfortable: the innovation isn't being patented. The gatekeeping is.

When did we accept that accessing public data, data we paid for with our taxes, should require licensing someone's "invention" of searching it?

If you've ever tried to programmatically access HTS codes, you know exactly what I'm talking about. If you haven't, you're about to get angry on behalf of people who have.

We paid for the water. They patented the faucet.

Let's Talk About What's Actually Happening

I'm not being abstract here. Companies are filing patents that describe, in essence:

  • "Method for semantic classification using retrieval augmented generation"
  • "System for automated tariff code recommendation using machine learning"
  • "Method for similarity-based classification in hierarchical taxonomies"

Let me translate: RAG over public data. Semantic search. Basic ML applications.

Here's a real example. A recent patent application claims as novel invention:

"A system comprising: a database storing hierarchical classification codes; an embedding model configured to generate vector representations of text descriptions; a retrieval mechanism configured to receive a query, generate a vector representation, and return ranked results based on similarity."

Read that again. That's describing a vector database. Applied to public data.

The "invention" is: take sentence embeddings (open source, 2019), use FAISS or Pinecone (freely available), apply it to HTS codes (taxpayer-funded public data), return results. That's it. That's the patent.

Here's the technical reality of what these patents claim as novel:

  • Sentence transformers. Open source since 2019. The entire field uses them.
  • Vector search. Pinecone, Weaviate, Milvus, Qdrant, dozens of open implementations.
  • RAG. Described in academic papers, widely documented, implemented across thousands of applications.
  • Hierarchical classification. Not new. Not even close to new.

None of this is novel. It's the application of existing, well-understood techniques to a new domain. Which is valuable. Which takes work. But it's not an invention. It's engineering.

And here's the concerning pattern: instead of competing on execution quality, companies are racing to patent obvious applications of public techniques as a moat around public data.

The Real Innovation Deficit

You want to know what's actually hard about HTS classification?

  • Parsing inconsistent government data formats (PDF tables, nested HTML, changing schemas)
  • Maintaining version control across HTS schedule updates (twice yearly, with retroactive changes)
  • Building high-quality embeddings from sparse, technical descriptions
  • Handling legal precedent and CROSS rulings (200,000+ entries, unstructured text)
  • Dealing with edge cases where multiple codes could apply
  • Making classifications explainable for customs compliance

None of this is patentable. It's hard, valuable engineering work. It's data cleaning, pipeline maintenance, domain expertise, and careful system design.

But instead of competing on who does this better, who builds faster pipelines, cleaner data, better user experiences, companies are trying to patent the basic approach itself.

The result? Every new entrant has to:

  • Navigate a patent minefield
  • Build the same infrastructure from scratch (can't build on prior work)
  • Scrape the same government websites everyone else scrapes
  • Parse the same inconsistent formats
  • Duplicate the same fundamental engineering work

This isn't innovation. It's waste.

Data as a Public Utility

Let's step back. What is the HTS schedule?

  • Maintained by the U.S. International Trade Commission (taxpayer-funded)
  • CROSS rulings published by U.S. Customs and Border Protection (taxpayer-funded)
  • Tariff rates set by the Department of Commerce (taxpayer-funded)

This is our data. We paid for it.

Every year, $3 trillion in goods enter the United States. Every single shipment needs an HTS classification. Every classification depends on public data maintained by government agencies with public funds.

Yet here's the current state:

  • Government publishes data in PDFs and HTML tables (practically unusable at scale)
  • Private companies scrape, structure, and charge thousands per month for access
  • Patents get filed on "methods" to search this public data
  • Small businesses can't afford the subscriptions
  • Developers build brittle scrapers that break with each update
  • Everyone duplicates the same infrastructure work

This is infrastructure failure, not innovation.

Why Government Doesn't Fix This

Here's the reality: government agencies aren't failing accidentally. They're succeeding at the wrong thing.

USITC's mandate is to publish the HTS schedule. Not to make it usable. Not to provide APIs. Not to ensure programmatic access. Just to make it technically available.

They meet that mandate with PDFs. Box checked. Budget justified.

CBP's mandate is to collect customs rulings and make them publicly accessible. They do that with an HTML search interface that breaks constantly and returns inconsistent results. Mission accomplished.

There's no line item in their budget for "build developer-friendly infrastructure." There's no performance metric for "API uptime." There's no promotion for the person who makes HTS data machine-readable.

The incentive structure produces exactly what we see: data that is public in legal terms but unusable in practical terms.

This isn't incompetence. It's institutional design. Agencies optimize for the metrics they're judged on, and "ease of programmatic access" isn't one of them.

The Weather Data Parallel

We used to pay private companies for weather data.

Before the 1990s, accessing weather forecasts and meteorological data meant paying substantial fees to private companies that acted as intermediaries to government weather services. Basic forecasts were expensive. Detailed data was prohibitively costly. Innovation was constrained by data access costs.

Then NOAA made weather data freely available through open APIs.

The result?

  • Explosion of weather applications and services
  • Weather.com, Dark Sky, thousand-day forecasts, hyperlocal alerts
  • Weather data integrated into everything from farming to logistics
  • Competition on features, user experience, and accuracy, not data access
  • Massive economic value created from public infrastructure

Today, nobody argues that weather data should be paywalled. It would seem absurd. Weather data is recognized as public infrastructure.

Trade data should work the same way.

The Real Cost

This isn't just a philosophical problem. It's expensive.

U.S. trade penalties run into the billions annually, with the majority stemming from unintentional violations. Most of these? Classification errors, using the wrong HTS code because the data is hard to access and harder to search correctly.

The direct costs add up fast:

  • First ISF (Importer Security Filing) violation: $5,000 per shipment
  • Repeat offenses: $10,000+ per shipment
  • Classification errors: penalties ranging from thousands to millions, depending on the duty differential

But the direct fines are just the beginning:

  • Customs delays. Misclassified products trigger inspections, holding up inventory and disrupting supply chains.
  • Audit risk. Companies with classification errors face increased scrutiny on all future imports.
  • Legal fees. Contesting or resolving penalties requires expensive, specialized counsel.
  • Opportunity cost. Every hour spent parsing government PDFs is an hour not spent building product.

For small importers, a single classification error can mean the difference between profitability and bankruptcy. And who bears this cost disproportionately? Not large enterprises with dedicated trade compliance teams. Small businesses. Startups. First-time importers.

The people who need accessible data most are the ones who can't afford the gatekeepers.

This isn't because businesses are careless. It's because accessing the information needed to classify correctly is unnecessarily difficult and expensive.

What Public Infrastructure Actually Looks Like

"Data as a public utility" isn't a metaphor. It's a technical architecture.

Here's what the layers should look like:

Public Data (Already Paid For)
    • HTS Schedule (USITC)
    • CROSS Rulings (CBP)
    • Tariff Rates (Commerce)

            │ Currently: PDFs, HTML, brittle scraping
            │ Should be: Clean APIs, structured data

            v
Open Infrastructure (Community)
    • Structured database schema
    • Clean REST APIs (OpenAPI specs)
    • Semantic search (open models)
    • Documentation and tooling
    • Version control and updates

            │ Free to use, fork, modify
            │ Compete on execution

            v
Value-Added Services (Businesses)
    • Professional features
    • Hosted infrastructure
    • Expert consultation
    • Custom integrations
    • Compliance automation

The key principle: infrastructure should be free, services can charge.

This isn't radical. It's how healthy technical ecosystems work:

  • Linux (free) → Red Hat (paid support and enterprise features)
  • Postgres (free) → Supabase / AWS RDS (paid hosting and tooling)
  • Python (free) → PyCharm / Anaconda (paid professional tools)
  • React (free) → Vercel / Netlify (paid deployment and services)

None of these companies needed patents. They compete on execution: better performance, better support, better user experience, better integrations.

Why Patents Are Particularly Corrosive Here

Patents made sense for physical inventions. Real R&D investment. High capital costs. Manufacturing barriers. True technical breakthroughs that needed protection to justify investment.

In software, especially in applied machine learning, patents are often just gatekeeping.

"We applied RAG to trade data" is not an invention. "We built semantic search over HTS codes" is not an invention. These are applications of known techniques to a new domain. That's valuable work, but it's not patentable innovation.

The real effect of these patents: they create barriers to accessing public information.

Think about what happens:

  • Every company scrapes the same government websites
  • Everyone builds the same parsers for the same formats
  • Everyone implements the same search infrastructure
  • Everyone duplicates the same data cleaning pipelines
  • Nobody can build on anyone else's work
  • Innovation happens in silos

Patents don't protect innovation here. They prevent infrastructure.

The irony is that the people filing these patents are themselves building on decades of open research:

  • Transformer models (Vaswani et al., 2017, open paper)
  • BERT and sentence embeddings (Google, 2018, open sourced)
  • Vector databases (decades of research, countless implementations)
  • RAG architectures (Lewis et al., 2020, open paper)

They took open research, applied it to public data, and now want to patent the combination. That's not standing on the shoulders of giants. That's pulling the ladder up behind you.

The Innovation We Actually Need

You want to know what would be genuinely innovative?

Government providing structured APIs for public data.

USITC publishing the HTS schedule as a versioned, machine-readable database with a clean REST API. CBP exposing CROSS rulings as queryable datasets with proper search capabilities. Commerce Department providing tariff rates through standardized endpoints.

That would be innovation. That would be infrastructure.

Instead, we have PDFs and HTML tables. And companies racing to patent the basic techniques required to make that data usable.

The real innovation deficit isn't in machine learning. It's in recognizing that public data requires public infrastructure.

How This Should Work

Let me be clear about what "data as public utility" actually means.

It's not about everything being free. It's not about eliminating businesses. It's about layers.

The infrastructure layer, structured databases, clean APIs, searchable formats, should be public. Just like roads, or electrical grids, or weather data.

The value-added layer, professional tools, expert systems, hosted services, custom integrations, can absolutely be commercial. Should be commercial.

This is how healthy technical ecosystems work:

  • Linux is free. Red Hat sells enterprise support and features.
  • Postgres is free. Companies sell hosting, tooling, and managed services.
  • Python is free. Companies sell professional IDEs and platforms.
  • Weather data is free. Dozens of companies build forecasting apps and services.

None of these companies needed patents on "methods for using Linux" or "systems for querying Postgres." They compete on execution: better performance, better support, better user experience.

Infrastructure should be free. Services can charge.

The "But What About Patents?" Question

Let me address the objection head-on: "But don't companies need patents to raise capital? Without IP protection, who funds infrastructure?"

Fair question. Let's think it through carefully.

First claim: Infrastructure needs patents to be funded.

Counterexample: the entire internet. Linux. Postgres. Python. Git. Docker. Kubernetes. React. The foundational infrastructure of modern computing has zero patents on core functionality.

These aren't small projects. These are billion-dollar ecosystems. Linux powers AWS, Google Cloud, and Azure. Postgres runs inside countless enterprises. React powers Facebook, Netflix, and thousands of applications.

None of them needed patents. They got funded anyway.

Second claim: Companies need patent defensibility to justify investment.

Look at the actual capital markets:

  • Elastic. Open source search, $8B+ market cap, raised hundreds of millions.
  • HashiCorp. Open source infrastructure tools, multi-billion dollar valuation.
  • Supabase. Open source Postgres platform, $80M+ ARR.
  • Hugging Face. Open source ML models, $4.5B valuation.
  • Confluent. Open source Kafka, multi-billion dollar company.

None of these companies have patents on their core techniques. Investors bet hundreds of millions on them anyway.

Why? Because real defensibility comes from execution:

  • Better product experience
  • Faster innovation cycles
  • Deeper domain expertise
  • Superior customer support
  • Network effects and community

Patents on "semantic search for public data" isn't defensibility. It's rent-seeking.

The reality is that companies don't patent to fund infrastructure. They patent to create moats around public data.

The real question isn't "how do businesses survive without patents?"

The real question is: "Why should accessing public data require licensing basic ML techniques from private companies?"

That's the question patent defenders don't want to answer.

Supabase doesn't have patents on "systems for accessing Postgres." They have excellent developer experience, managed hosting, and real-time features. That's their moat.

The companies that succeed on open infrastructure compete on:

  • Performance. Faster, more reliable, better optimized.
  • User experience. Easier to use, better documentation, thoughtful design.
  • Domain expertise. Deep understanding of customer needs.
  • Support quality. Responsive help when things break.
  • Integration ecosystem. Works seamlessly with existing tools.

None of that requires patents. In fact, patents often signal the opposite: "Our execution isn't strong enough, so we need legal protection."

The counterargument fails on its own terms. If the underlying technique is actually novel and non-obvious, then building on it should be hard even without patent protection. If it's easy to replicate, then it probably shouldn't have been patentable in the first place.

Infrastructure built on public data should compete on execution, not on who hired the best patent attorney.

What Gets Built on Public Infrastructure

When infrastructure is public, competition happens where it should: on execution quality.

Companies don't compete on who has the best parser for government PDFs. They compete on who builds the best user experience, the most reliable service, the most valuable features.

This is better for everyone:

  • Developers spend time on innovation, not duplicating basic infrastructure
  • Small businesses can afford to build or buy tools
  • Established companies compete on quality, not legal moats
  • Users get better products from actual competition

The question isn't whether businesses can thrive on public infrastructure. The question is why we accept that they can't.

What Success Looks Like

Success isn't a single company dominating the market. Success is infrastructure that enables an ecosystem.

Near term:

  • Open infrastructure becomes available (however it gets built)
  • Multiple companies build commercial products on it
  • Competition shifts from data access to product quality
  • Patents on basic techniques become irrelevant (prior art documented, techniques standardized)

Long term:

  • Government recognizes the need and provides official APIs
  • USITC publishes structured HTS data with version control
  • CBP exposes CROSS rulings as queryable datasets
  • Industry norm becomes: public data = public infrastructure

The ultimate success: infrastructure becomes so standard that we stop thinking about it as special.

Just like we don't debate whether weather data should be open anymore. Just like we don't patent "methods for reading road signs." Some things should just be infrastructure.

The Policy Question

The real solution isn't technical. It's policy.

USITC should provide structured APIs for HTS data. Not PDFs. Not HTML tables. Actual REST APIs with versioning, documentation, and reliability guarantees.

CBP should expose CROSS rulings as machine-readable, queryable datasets. Make them searchable, linkable, and programmatically accessible.

Like NOAA did for weather data, make trade data truly public. Not just "technically available" but actually usable.

This isn't a radical proposal. It's how public infrastructure should work.

Until that happens, the gap will be filled by whoever has the resources to build infrastructure and the patents to defend it. That's not a market failure. That's an infrastructure failure.

The Broader Pattern

This isn't unique to trade data. Look across domains:

  • Legal databases. Court decisions are public records. Accessing them requires expensive Westlaw or LexisNexis subscriptions.
  • Regulatory data. FDA approvals, EPA rules, SEC filings, all public, all gatekept by intermediaries.
  • Scientific papers. Often taxpayer-funded research. Locked behind journal paywalls.
  • Geographic data. Government mapping data. Monetized by private companies despite public funding.

The pattern repeats:

  • Government creates valuable data with public funds
  • Publishes in formats that are technically "public" but practically unusable
  • Private companies structure it and charge for access
  • Patents get filed on basic techniques
  • Infrastructure gets duplicated instead of shared

The challenge: fix the infrastructure, not just the symptoms.

Government can do this. Policy makers can push for it. Technologists can build it (with or without government). But someone needs to recognize that this is an infrastructure problem, not a market problem.

Who This Is For

This is for people who believe:

  • Developers should be able to access public data without expensive subscriptions
  • Small businesses shouldn't be priced out of basic compliance tools
  • Researchers should be able to study trade patterns without gatekeepers
  • Competition should be on execution quality, not legal moats
  • Public data deserves public infrastructure

If you're a customs broker tired of vendor lock-in, this is for you.

If you're a freight forwarder building internal tools, this is for you.

If you're a policymaker who thinks accessible information matters, this is for you.

If you're a developer who's scraped one too many government websites, this is for you.

The Real Question

Here's what I'm asking:

Why do we accept that accessing public data should require expensive subscriptions and patent licenses?

Not "how do we build a better product?" That's easy. Every data scientist knows how to build semantic search. Every engineer knows how to parse government data.

The hard question: why are we solving the same infrastructure problem dozens of times instead of once?

This isn't about any single company or project. It's about recognizing that public data needs public infrastructure.

The question isn't whether this is a problem. The question is what we're going to do about it.

What You Can Do

If you're a developer working in this space:

  • Publish your schemas. Document your parsers. Make your work forkable.
  • Establish prior art through working code, not patent applications.
  • Build in public. Share what you learn.
  • Let's create a commons of trade data infrastructure.

If you're funding work in this space:

  • Reward open infrastructure, not patent portfolios.
  • Back teams that build on public standards rather than those trying to patent public techniques.

If you're a small business dealing with classification:

  • Push back on expensive subscriptions. Demand transparency.
  • Share your pain points publicly. The more visible this problem is, the harder it is to ignore.

If you're a policymaker or government employee:

  • Recognize that "technically public" isn't "actually accessible."
  • Budget for infrastructure, not just publication.
  • Look at what NOAA did with weather data. That's the model.

If you're a researcher studying trade, economics, or policy:

  • Make noise about data accessibility issues in your field.
  • Cite the infrastructure gap. Advocate for programmatic access.

The tools exist. The techniques are well-understood. The data is already public.

What's missing is the will to recognize this as an infrastructure problem, not a market opportunity.

Back to First Principles

I started looking at this as a data scientist trying to solve a technical problem.

I realized it's not a technical problem. It's an infrastructure problem.

The machine learning isn't the innovation. The semantic search isn't the innovation. Those are applications of well-understood techniques.

The innovation is recognizing that this should be infrastructure, not a product.

The core issue: we're patenting basic techniques and charging for access to public data because we've failed to build proper infrastructure.

Weather data used to work this way. Then we fixed it.

Trade data works this way now. We can fix it too.

The data is ours. The infrastructure should be too.


These are my personal views on public data infrastructure. If you're working on similar problems, in trade, legal, regulatory, or any other domain, I'd love to hear your perspective.