--- name: create-tms-manifold description: Build a TMS v1.2 entity_profile manifold from PO-line data. Generates the L0 summary, L1 geometry, L2 telemetry preview, and lineage block in one envelope. --- # create-tms-manifold Build a Tabular Manifold Spec (TMS) v1.2 entity_profile manifold for one of: `item`, `supplier`, `commodity_group`, `supplier_category`. Output is a single JSON document that an agent can consume progressively (L0 first, L1 on demand, L2 for forensics). Full TMS spec: https://canonical.agency/spec/tms Companion QPS spec (provenance): https://canonical.agency/spec/qps Reference examples (TechnoFlex): https://canonical.agency/spec/tms#reference-examples ## When to use this skill Use this skill when the user wants to ship a manifold for procurement, spend-intelligence, or supplier-risk workflows and the source is a transactional PO-line table (silver-layer, one row per PO line) plus optional supplier-consolidation and item-categorization joins. ## What you produce A single JSON object with these top-level keys, in this order: ```json { "artifact_type": "tabular_manifold", "artifact_version": "1.2", "manifold_kind": "entity_profile", "subject": { }, "time_window": { }, "level_0_summary": { }, "level_1_geometry": { }, "level_2_telemetry": { }, "token_budget": { }, "lineage": { } } ``` L0 must include `reliability` and `quality_flags`. L1 is optional but recommended for time-windowed entities. L2 must declare a `strategy` and a `retrieval` block even when the preview is empty. ## Inputs you need from the user Ask up front, in one round-trip if possible: 1. **Entity type and ID.** One of `item` (ca_item_number), `supplier` (parent_supplier_id), `commodity_group` (commodity_group_label), `supplier_category` (sub_commodity_label). 2. **Source table** and dialect (Databricks SQL / Snowflake / DuckDB / Postgres). At minimum, a PO-line silver table; ideally also the supplier-mapping and item-categorization views. 3. **Time window.** Default to the last 24 months ending on the most recent observation. 4. **Granularity** for L1. Default to `month` for windows ≥ 6 months, `day` for shorter windows. If anything is missing and you can infer a sensible default, state the default and proceed. Do not stall on optional parameters. ## Process ### 1. Confirm scope and pull headline numbers Run a single aggregate query to confirm the entity exists in the window and to size the work: ```sql SELECT COUNT(*) AS po_line_count, COUNT(DISTINCT po_num) AS po_count, MIN(purchase_order_date) AS first_purchase, MAX(purchase_order_date) AS last_purchase, SUM(spend) AS total_spend_usd FROM WHERE AND purchase_order_date BETWEEN :start AND :end; ``` If `po_line_count < 30`, mark `sample_size_class: "sparse"` and warn the user that L0 stats will be unreliable. Do not refuse; emit the manifold anyway with the flag set. ### 2. Build `level_0_summary` Required sub-blocks: - **`observation_count`** = po_line_count from step 1. - **`time_coverage`** = first/last purchase + window_days. - **`distribution`** = min, max, mean, median, stddev, cv on `unit_price`. - **`reliability`** = sample_size_class, confidence_in_mean (95%), data_quality_score (1 minus the share of rows with imputed or null unit_price), staleness block. - **`quality_flags`** = booleans. Trigger rules: - `low_sample_size`: sample_size_class == "sparse" - `data_staleness`: staleness.is_stale == true - `high_variance`: distribution.cv > 0.3 (entity-level only; skip for supplier and supplier_category where the global CV is mechanically large) - `suspected_outliers`: any row with |z_score| > 3 - `low_supplier_diversity`: (item only) supplier_count == 1 - `high_supplier_concentration`: HHI > 0.25 - `high_item_concentration`: top_item_pct > 0.5 - `level_1_*_truncated`: set when the Pareto truncation in §3 dropped rows - **`financial_summary`** = total_spend_usd, total_qty_received, po_count, po_line_count, avg_spend_per_po_line, pct_of_total_spend, population_rank, population_size. - **Concentration blocks** (per entity type): - item → `supplier_concentration` (HHI, top_supplier_pct, top_3) - supplier → `item_concentration` + `commodity_concentration` - commodity_group → `supplier_concentration` + `item_diversity` + `sub_category_breakdown` - supplier_category → `supplier_concentration` + `item_diversity` + `industry_breakdown` + `direct_indirect_mix` - **`price_stability`** = global cv, weighted_avg_item_cv (spend-weighted per-item CV over items with ≥2 PO lines), discipline_rating from the bands `{Excellent: <0.10, Good: <0.25, Fair: <0.50, Poor: ≥0.50}`, and a `rating_basis` object documenting thresholds and method. - **`relationship`** = first/last purchase, relationship_days, avg_monthly_po_frequency. - **`interpretation_hints`** = 3-4 sentences naming the dominant signal, one quantitative anchor each. Lead with the actionable finding, not the metric definition. ### 3. Build `level_1_geometry` `monthly_timeseries`: one row per month in the window. Columns: `period`, `n`, `spend`, `qty`, `mean_unit_price`, optional `stddev_unit_price`. Include `missing_periods: []` if the window is contiguous; otherwise list the gaps. Then add one or two rollups (per entity type): - item → `supplier_rollup` - supplier → `item_rollup` + `commodity_rollup` - commodity_group → `supplier_rollup` + `item_rollup` + `sub_category_rollup` - supplier_category → `supplier_rollup` + `item_rollup` + `industry_rollup` Each rollup uses **Pareto truncation**: ``` Sort rows DESC by spend. Emit rows until cumulative pct_of_spend >= target_coverage, with rows_emitted bounded by [min_rows, max_rows]. For ties at the boundary, use tie_break = "tie_extended_by_spend_v2" (include the tied rows if they add unique spend). Summarize remainder as tail_summary { rows_truncated, spend, pct_of_spend, po_count, po_line_count, plus type-specific extras }. ``` Coverage targets: 0.80 for supplier/item rollups, 0.95 for industry/sub_category rollups (long-tail signal matters more there). ### 4. Build `level_2_telemetry` Set `row_count_total` to the total PO lines in the window. Choose a `strategy`: - `preview_outliers` if any |z_score| > 2 rows exist (recommended default) - `preview_recent` for stable entities with no outliers - `preview_concentration_evidence` for high-concentration commodities - `full_inline` only if row_count_total < 50 - `none` if downstream agents will always retrieve Cap inline_rows at the limit you'll record in `token_budget`. Each row should include: rn, po_num, date, site, supplier_name, ca_item_number (or part_description), qty, qty_uom, unit_price, spend, z_score, flag. Add a `retrieval` block. For agents, prefer `method: "mcp_tool"` with tool_name and tool_args. For human auditors, also include `sql_fallback` with the verbatim query_template and query_params used to populate the inline preview. ### 5. Write the `token_budget` block Estimate L0 / L1 / L2 token costs (counts of UTF-8 chars / 4 is a usable approximation). Always emit `compression_ratio_l1_vs_l2` and `compression_ratio_l0_vs_l2`. Write a one-sentence `recommended_strategy` describing what an agent should read first. ### 6. Write the `lineage` block `manifold_id` format: `mfld___`. Include `qps_entry_id` (a parallel ID for the Query Provenance Store companion spec). List each input dataset with `dataset_id`, `version` (or as_of_timestamp), `row_count`, and the `as_of_timestamp` used. List `filters_applied` (each as `{field, operator, value}`) and `transformations` (one prose line per transformation). Compute a `checksum` over the sorted, serialized telemetry rows (`method: "sha256_row_hash"`). ## Validation checklist Before returning the manifold, verify: - [ ] `artifact_type == "tabular_manifold"` and `artifact_version` set. - [ ] `manifold_kind == "entity_profile"`. - [ ] `level_0_summary.reliability` and `level_0_summary.quality_flags` both present. - [ ] HHI math: sum of `(spend_i / total_spend)^2` across the concentration grain. Sanity: HHI ≤ 1.0. - [ ] All top_3 / top_5 rows are sorted DESC by spend. - [ ] Percentages in any rollup (`pct_of_spend`) plus the tail_summary percentage sum to ~1.0 (allow rounding). - [ ] Every monthly_timeseries period falls inside `time_window`. - [ ] `level_2_telemetry.strategy` is one of the enum values. - [ ] `lineage.inputs[]` is non-empty. - [ ] `quality_flags.level_1_*_truncated` matches whether any rollup actually truncated. - [ ] No null values in required fields. Use empty strings or 0 only where the spec allows. ## Templates ### Empty envelope skeleton ```json { "artifact_type": "tabular_manifold", "artifact_version": "1.2", "manifold_kind": "entity_profile", "subject": { "entity_type": "" }, "time_window": { "start": "YYYY-MM-DD", "end": "YYYY-MM-DD", "granularity": "month" }, "level_0_summary": { "observation_count": 0, "time_coverage": { }, "distribution": { }, "reliability": { }, "quality_flags": { }, "financial_summary": { }, "price_stability": { }, "relationship": { }, "interpretation_hints": [] }, "level_1_geometry": { "monthly_timeseries": { "format": "row_objects_v1", "granularity": "month", "rows": [] } }, "level_2_telemetry": { "row_count_total": 0, "strategy": "preview_outliers", "inline_rows": { "format": "row_objects_v1", "rows": [] }, "retrieval": { "method": "mcp_tool", "tool_name": "", "tool_args": { } } }, "token_budget": { "level_0_tokens_approx": 0, "level_1_tokens_approx": 0, "level_2_inline_tokens_approx": 0, "row_count_total": 0, "inline_row_limit": 10, "compression_ratio_l1_vs_l2": 0, "compression_ratio_l0_vs_l2": 0, "recommended_strategy": "" }, "lineage": { "manifold_id": "", "computed_at": "", "computed_by": "", "qps_entry_id": "", "inputs": [], "filters_applied": [], "transformations": [], "checksum": { "method": "sha256_row_hash", "value": "" } } } ``` ### Discipline-rating band logic ```text if weighted_avg_item_cv < 0.10: "Excellent" elif weighted_avg_item_cv < 0.25: "Good" elif weighted_avg_item_cv < 0.50: "Fair" else: "Poor" ``` ### HHI on rollup output ```text hhi = sum( (row.spend / total_spend) ** 2 for row in all_rows_including_tail ) top_supplier_pct = max(row.spend for row in all_rows) / total_spend ``` Use ALL rows for HHI, not just the truncated head. The tail belongs in the math even when it is summarized away in the rollup. ## What not to do - Do not collapse the global CV and the weighted_avg_item_cv into a single number. They measure different things and the user will read them differently. Always emit both with `rating_basis.note` explaining the distinction. - Do not emit `quality_flags.high_variance` for supplier and supplier_category manifolds based on the global CV. Their product mix makes the global metric mechanically large; use the weighted per-item metric instead, or omit the flag. - Do not invent values for fields the source can't support. Set the relevant `quality_flag` and let the agent decide. - Do not interpolate parameters into the SQL `query_template`. Use parameterized templates with `:name` placeholders and provide `query_params` separately. The QPS spec rejects interpolated SQL on purpose. - Do not include PII in `level_2_telemetry.inline_rows`. The preview is meant for outlier evidence, not full record dumps. ## Pointers - The four TechnoFlex reference manifolds at https://canonical.agency/spec/tms#reference-examples are the authoritative shape examples. When in doubt, match their structure. - The glossary on each viewer page (e.g. https://canonical.agency/manifolds/technoflex/supplier ) explains fields like HHI, weighted_avg_item_cv, and Pareto tie-breaking in plain language. - The companion QPS spec (https://canonical.agency/spec/qps ) covers the lineage / replay side. If the consumer will need to verify drift later, write the QPS entry at the same time as the manifold.