ReferenceQPS v1.0

Operations reference

How to operate QPS in production: choosing a checksum method, interpreting drift types, deciding when to replay, setting retention, and the security posture the spec assumes you take.

Choosing a checksum method

The checksum method controls how sensitively drift gets detected. Stronger methods catch more, at the cost of compute. Four options are documented:

MethodWhat it catchesCostUse when
row_content_hashAny row addition, removal, or content change. Strongest.High. SHA256 over sorted, serialized rows.Default for procurement and any data where row-level accuracy matters.
row_count_onlyRow additions and removals. Misses content changes.Trivial.Append-only sources where you trust the source to never update rows in place.
column_stats_hashRow count changes plus aggregate shifts (min, max, sum per column). Misses content changes that preserve aggregates.Medium. One pass per column, no row serialization.Very large telemetry sets where row_content_hash is too expensive but row_count_only is too weak.
noneNothing. No checksum computed.Zero.Development or one-shot exports where you do not intend to detect drift later.

row_content_hash, in pseudocode

function row_content_hash(rows):
    sorted = stable_sort(rows, key=primary_sort_keys)
    serialized = canonical_json(sorted)
    return sha256(serialized.encode("utf-8"))

Sort stably on the primary sort keys before serializing. Unstable sort breaks replay (the same rows hash differently). Use canonical JSON (sorted keys, no whitespace) so the same content always serializes the same way.

column_stats_hash, in pseudocode

function column_stats_hash(rows):
    stats = {}
    for col in numeric_columns(rows):
        stats[col] = {
            "min": min(row[col] for row in rows),
            "max": max(row[col] for row in rows),
            "sum": sum(row[col] for row in rows),
            "count_non_null": count_non_null(row[col] for row in rows),
        }
    return sha256(canonical_json(stats).encode("utf-8"))

Cheaper than row_content_hash because you do not serialize rows. Catches the common drift cases (late-arriving rows, mass updates, backfills) but misses surgical changes that preserve aggregates. Worth pairing with row_count tracking for defense in depth.

Drift type taxonomy

When drift_detected is true, the execution entry must set a drift_type. Five values are documented. Each has a typical cause and a recommended response.

row_count_increase

More rows came back than at generation. Typical causes:

  • Late-arriving transactions (POs entered after the generation snapshot but with purchase dates inside the window).
  • Batch backfill from a source system, common at month-end or after a data fix.
  • Time-window query against a non-frozen window (your query says “last 24 months” rather than a fixed end date).

Response:compute the impact on the manifold's L0 numbers. If the delta is within rounding (HHI moves by <0.01, total spend by <1%), record the assessment in drift_note and continue using the original manifold. If the delta meaningfully changes a decision, rebuild the manifold.

row_count_decrease

Fewer rows came back. Typical causes:

  • Source data was deleted or corrected after generation (often fraud reversal or data-quality cleanup).
  • A predicate behaves differently (filter values were changed upstream, e.g., a supplier was deactivated).
  • Source dataset was rolled back to an earlier snapshot.

Response: investigate before acting. Decreases are more often suspicious than increases. Capture which rows are missing (the source-side audit log usually answers why) and decide whether the original manifold or the new replay reflects reality.

content_change

Same row count, different content. Typical causes:

  • In-place updates to existing rows: a supplier name correction, a unit_price fix, a re-mapped commodity classification.
  • Schema migration that preserves row identity but rewrites representation (currency conversion, UoM standardization).
  • Replay-time joins picking up newer reference data (newer supplier-mapping rows changing how the silver lines roll up).

Response: the diff matters more than the fact of the diff. Capture which columns changed in drift_delta. If the changes are corrections (e.g., supplier_name fixes), the new content is usually right. If they are restatements (commodity remapping), the old content was correct at the time the agent reasoned over it.

schema_change

The columns or types differ. Typical causes:

  • A column was added, removed, or renamed in the source dataset.
  • A type was widened or narrowed (int to bigint, text to varchar).
  • The view definition changed and now exposes a different shape.

Response: the QPS entry is now stale at the shape level. Either update the query template to match the new schema (preserving param signatures where possible) or mark the entry as superseded and emit a new one. Do not silently coerce the new shape to the old one; that hides a real upstream change.

query_failure

The query no longer executes at all. Typical causes:

  • Source dataset was renamed, dropped, or moved to a new catalog.
  • A reference view or function the query depends on is gone.
  • Permissions changed and the executing role can no longer read the source.

Response: the entry can no longer be replayed from source. If the original manifold is still in cache, you can still consume it (with the caveat that it cannot be verified). Migrate the QPS entry to point at the new source, or supersede it with a fresh entry.

Replay policy

When should you call qps_reconstruct rather than consume the cached manifold? The default heuristic:

  • Read-only L0 + L1 use: trust the manifold. The summary statistics and aggregated geometry do not need to round-trip to source on every read.
  • Drill into L2 evidence: replay. The whole point of L2 is to back a claim with raw rows; pulling those at read time gives the consumer a fresh view and writes an execution log entry.
  • Human audit ahead of a decision: replay. The cost of being wrong is high; the cost of one more SQL execution is trivial.
  • Routine agent drill-down: replay if the manifold is older than the staleness threshold (default 90 days for procurement) and consume cached otherwise.
  • Reconstructing what an agent saw at time T: DO NOT replay. The point is to reproduce the historical read, which means trusting the row_count and checksum captured in generation. Replaying tells you what the data looks like now, not what it looked like at T.

Retention and TTL

Default retention for procurement entries: two full audit cycles (24 months). Rationale: most procurement disputes surface within one audit cycle; a second cycle gives the consumer the prior year's entry to compare against during reconciliation.

Tighten or extend the default based on the regulatory posture:

  • SOX-relevant categories: 7 years from generation. Set retention_basis: "sox_seven_year".
  • FDA / pharma supplier records: typically 10 years from product end-of-life. Set retention_basis: "fda_pharma_10y".
  • R&D or exploratory analytics: 90 days is fine. Cheaper and less compliance surface.

Use delete_strategy: "soft_delete" for any entry with related_manifolds that might still be consumed. The execution log is the audit trail, and you want the trail readable even after the entry is expired.

Security expansion

The spec page covers security in four bullets; in practice these unpack into the following posture decisions.

Query template exposure

The query.template field reveals schema names, column structure, and join semantics. For some categories that is fine (most procurement views are well-known internally). For others (PII tables, fraud detection schemas), exposing the template is a real risk.

When exposure is unacceptable, use the dialect: "mcp_tool" variant. The QPS entry stores only a tool name and args; the underlying query lives inside the MCP tool implementation, behind whatever permissions you put on the tool registry.

Execution log as access pattern signal

The executions[] log reveals who ran what and when. For audit cases this is the value proposition. For threat-model cases (insider threat, suspicious access patterns), the log itself becomes sensitive: an attacker who reads it learns what other people have looked at.

Treat the QPS store as a Tier-2 sensitive system. Don't expose execution logs through the same surfaces you expose manifolds through; query and audit them through a more restricted role.

Parameter values vs. parameter templates

query.params contains the literal values used at generation. For a supplier_id or part_id this is fine. For an employee_id, an email, or any direct identifier, you have just pinned PII into a long-lived audit record.

Rule: never store direct PII as a QPS param. Hash, tokenize, or replace with an opaque ID at the source. If a query genuinely needs PII to run, use the mcp_tool dialect and bind the PII inside the tool, not in the QPS entry.

Replay as a side effect

qps_reconstruct executes against the source database. On large telemetry sets, an unconstrained replay endpoint can become a DOS vector against the warehouse. Rate-limit replays at the tool surface, prefer cached manifolds when the answer does not require fresh rows, and record replay execution time in executions[].execution_time_ms so cost outliers get visibility.