# Calculations reference

Math that the main SKILL.md procedure refers to. Self-contained so the
SKILL body can stay lean.

## Sample size and reliability

### Sample size classes

`sample_size_class` is a banded label on `observation_count`
(the number of PO lines in the window):

```text
if observation_count < 30:  "sparse"
elif observation_count < 100: "adequate"
else: "robust"
```

Why the bands matter:

- `sparse`: stats are unstable, CIs are wide, outlier detection is
  unreliable. Triggers `quality_flags.low_sample_size`. Use IQR for
  outlier flagging instead of z-score (see Outlier detection below).
- `adequate`: stats are usable but not rock-solid. Z-score is
  acceptable but compute the CI against a t-distribution rather than a
  normal.
- `robust`: Central Limit Theorem applies; z-distribution CI and
  z-score outlier detection are both fine.

### Confidence interval for the mean

Always emit `confidence_in_mean` at the 95% level. Formula depends on
sample size class:

```text
sample stddev s = sqrt( sum( (x_i - mean)**2 ) / (n - 1) )
standard error  = s / sqrt(n)

if class == "robust":
    margin_of_error = 1.96 * standard_error
else:
    margin_of_error = t_critical(0.025, df = n - 1) * standard_error

interval = [mean - margin_of_error, mean + margin_of_error]
```

For sparse samples, the t-critical at df = 5 is ~2.57; at df = 29 it
is ~2.05. The CI gets visibly wider as n shrinks: that is a feature,
not a bug. Use sample stddev (divide by `n - 1`), not population
stddev.

## Population and spend share

### population_size and population_rank

For a manifold of kind `K` over time_window `W`:

```text
population       = distinct entities of kind K observed in the
                   source table during window W
population_size  = count(population)
population_rank  = rank of this entity by total_spend_usd within
                   population, descending (1 = largest spender)
```

Do not extrapolate beyond W. If the window is 24 months,
`population_size` is "active entities in the last 24 months", not
"all entities the company has ever transacted with".

### pct_of_total_spend

```text
pct_of_total_spend = this_entity.total_spend_usd
                   / sum(e.total_spend_usd for e in population)
```

The denominator is total spend across the entire population of kind K
in the window. It is NOT total company procurement spend across all
kinds; that is a different metric and TMS does not cross kinds.

For a `supplier_category` manifold, denominator = sum across all
supplier_categories in the window. For a `supplier` manifold,
denominator = sum across all suppliers in the window. And so on.

## Data quality

### data_quality_score

```text
data_quality_score = 1 - (imputed_or_null_count / total_row_count)
```

Range 0 to 1. Anything below 0.90 means a meaningful fraction of rows
needed imputation or had missing critical fields. When below 0.90,
emit `data_quality_notes` explaining what was filled and why.

### What counts as imputed

An "imputed" value is anything derived rather than directly observed.
Common cases:

- `unit_price` computed from `spend / qty` when the source row has
  spend but no unit_price.
- `qty` computed from `spend / unit_price` when the source row has
  spend and unit_price but no qty.
- `site_code` defaulted (e.g., to `"UNKNOWN"`) when the source row
  lacks site attribution.
- Currency converted via FX (see Currency normalization).
- Date defaulted to month-end or quarter-end when the source row
  carries only a coarse period.

Imputed values count against `data_quality_score` AND trigger
`quality_flags.imputation_applied = true`. Each imputation rule
applied must appear in `lineage.transformations` so an auditor can
reproduce.

### Staleness threshold

```text
days_since_last = today - staleness.last_observation
is_stale = days_since_last > stale_threshold_days
```

Default `stale_threshold_days`:

- `90` for routine PO-line procurement data (default).
- `365` for slow-cadence categories like capital equipment (long
  replenishment cycles; sparse observations are expected).
- `30` for real-time-ish flows (e.g., daily transactional categories
  in CPG or hospitality).

If you override the default, record it in
`reliability.staleness.stale_threshold_days` and call out the choice
in `data_quality_notes`.

## Currency normalization

TMS assumes a single reporting currency in the manifold (typically
USD). If the source table carries multiple currencies, normalize to
one BEFORE building the manifold, not after.

Process:

1. Pick a reporting currency. Record it on the relevant financial
   block if you need to be explicit (e.g., `currency: "USD"` on the
   subject or financial_summary).
2. Pick an FX policy:
   - **Monthly average FX**: simple, suitable for spend analytics
     where intra-month volatility is small.
   - **End-of-period FX**: matches accounting reconciliation cadence.
   - **Transaction-date FX**: most accurate but most expensive to
     compute. Use when individual lines matter (e.g., commodity_group
     analyses for hedged commodities).
3. Document the policy in `lineage.transformations` (one prose line)
   AND set `quality_flags.imputation_applied = true`. FX conversion is
   imputation.
4. Never mix currencies in distribution stats, HHI math, or rollups.
   The numbers come out meaningless: the CV blows up, the HHI bands
   stop mapping to the FTC scale, and rank ordering is wrong.

## Discipline rating bands

`discipline_rating` is a banded label on `weighted_avg_item_cv`:

```text
if weighted_avg_item_cv < 0.10: "Excellent"
elif weighted_avg_item_cv < 0.25: "Good"
elif weighted_avg_item_cv < 0.50: "Fair"
else: "Poor"
```

Always emit a `rating_basis` object alongside `discipline_rating`:

```json
"rating_basis": {
  "metric": "weighted_avg_item_cv",
  "threshold_excellent": 0.10,
  "threshold_good": 0.25,
  "threshold_fair": 0.50,
  "note": "Spend-weighted per-item CV. Cross-category comparisons are not meaningful."
}
```

This lets the consumer re-band against their own tolerance without
re-reading the spec.

## HHI on a rollup output

Herfindahl-Hirschman Index measures concentration. Compute over **all
rows including the Pareto-truncated tail**:

```text
hhi = sum( (row.spend / total_spend) ** 2 for row in all_rows )
top_supplier_pct = max(row.spend for row in all_rows) / total_spend
```

Bands (FTC convention, scaled to 0-1):

- `< 0.15` unconcentrated
- `0.15 <= hhi <= 0.25` moderately concentrated
- `> 0.25` highly concentrated (sets `high_supplier_concentration` flag)
- `1.0` monopoly (single source)

The tail belongs in the HHI math even when it is summarized away in the
rollup. Using only the truncated head will under-state concentration.

## Weighted average per-item CV

Isolates pricing volatility from product-mix effects:

```text
items_eligible = [i for i in items if i.po_line_count >= 2]
total_eligible_spend = sum(i.spend for i in items_eligible)

weighted_avg_item_cv = sum(
  i.cv * (i.spend / total_eligible_spend)
  for i in items_eligible
)
```

Single-PO items have no stddev and are excluded. Track the eligibility
share separately:

```text
pricing_data_coverage_pct = total_eligible_spend / total_entity_spend
```

If `pricing_data_coverage_pct < 0.80`, add a note in
`price_stability.rating_basis.note` explaining the gap. The
`discipline_rating` is only as trustworthy as its coverage.

## Pareto truncation algorithm

```text
function pareto_truncate(rows, target_coverage, min_rows, max_rows,
                         rank_metric="spend"):
    sorted_rows = sort(rows, key=rank_metric, descending=True)
    total = sum(r[rank_metric] for r in sorted_rows)

    emitted = []
    cumulative = 0.0
    for row in sorted_rows:
        if len(emitted) >= max_rows:
            break
        if (len(emitted) >= min_rows and
            cumulative / total >= target_coverage):
            # tie_extended_by_spend_v2: include this row only if it
            # adds non-trivial incremental spend at the boundary
            if row[rank_metric] / total >= 0.005:
                emitted.append(row)
                cumulative += row[rank_metric]
            break
        emitted.append(row)
        cumulative += row[rank_metric]

    tail = sorted_rows[len(emitted):]
    tail_summary = summarize(tail)

    return {
        "rows": emitted,
        "rows_truncated": len(tail),
        "tie_break": "tie_extended_by_spend_v2",
        "tail_summary": tail_summary,
    }
```

Default `target_coverage`:

- `supplier_rollup`: 0.80
- `item_rollup`: 0.80
- `industry_rollup`: 0.95
- `sub_category_rollup`: 0.95
- `commodity_rollup`: 0.80

Default `min_rows = 5`, `max_rows = 50` (supplier/item) or
`max_rows = 20` (industry/sub_category).

## tail_summary structure

Minimum fields for any tail_summary:

```json
{
  "rows_truncated": 0,
  "spend": 0,
  "pct_of_spend": 0,
  "po_count": 0,
  "po_line_count": 0
}
```

Type-specific extras:

- supplier_rollup tail → add `item_count` and `tail_industry_mix` (top 5
  industries by tail spend, each `{industry, spend, pct_of_tail}`)
- item_rollup tail → add `distinct_commodity_groups` and
  `tail_commodity_mix` (top 5 commodity groups)
- commodity_rollup tail → add `distinct_sub_categories`
- industry_rollup tail → add `distinct_industries` (typically equals
  `rows_truncated` since each row is already an industry)

## Outlier detection

Used to populate `quality_flags.suspected_outliers` (boolean) and to
flag individual rows in `level_2_telemetry.inline_rows`.

### Z-score (default for adequate and robust samples)

```text
z_score = (row.unit_price - entity.mean) / entity.stddev
```

Emit the line with `flag` set when `|z_score| >= 3`. Common flag values
used in the TechnoFlex examples:

- `"price_outlier_high"` for `z_score >= 3`
- `"price_outlier_low"` for `z_score <= -3`
- `"price_peak_period"` for `1.5 <= z_score < 3`
- `"uom_likely_misencoded"` for `z_score >= 8` (typically a
  per-railcar price mistakenly entered per-unit; verify the spend math
  to confirm)
- `"high_volume_lane"` for normal-priced rows with qty in the top
  decile (not an outlier per se, but useful evidence)

### IQR fallback (for sparse samples)

When `sample_size_class == "sparse"` (n < 30), z-score breaks down
because stddev is unreliable. Use IQR instead:

```text
Q1 = 25th percentile of unit_price
Q3 = 75th percentile of unit_price
IQR = Q3 - Q1

flag row as outlier if:
    row.unit_price < Q1 - 1.5 * IQR     (low)
    or row.unit_price > Q3 + 1.5 * IQR  (high)
```

Flag vocabulary for IQR-detected outliers:

- `"iqr_outlier_high"` for the upper tail
- `"iqr_outlier_low"` for the lower tail

Treat IQR and z-score as parallel detectors with the same downstream
behavior. `quality_flags.suspected_outliers` is true if EITHER
detector fires on any row. Record which detector was used in
`level_2_telemetry.retrieval` notes or in `data_quality_notes`.
