Skip to content

Demo · Calibration foundations

Three places upstream

where calibration fails before the agent sees it

Trust scores in production are weighted by feel. Five signals get combined, the composite ships, the agent acts on it. Most architectures audit the time gap. Almost none audit the weighting function. Flip the toggle in each place below. Same readings, different verdict. Stability is not the same as truthfulness.

01

The weighting function

Signals combined by feel, not by predictive power.

Almost none cover this

The weighting function decides how much each signal contributes to the composite trust score. Most weights are set by whoever drew the schema first. They survive every quarterly review because nobody audits them against realized outcomes. The signal that always reads high and still claims weight is the overweighted one. The signal that varies and predicts is the underweighted one.

Same readings, two weightings

Weights drawn when the schema was first sketched. Stable signals get the most weight because they look most reliable.

Lineage95 × 0.30
Freshness92 × 0.25
Ownership90 × 0.20
Usagepredicts32 × 0.10
Discoverability88 × 0.15

Composite

86

Act

Klarna AI customer service, 2024 to 2025. The capability score weighted cost and throughput. Quality was not in the composite. By May 2025 customers were complaining and the CEO reversed course, citing "cost as a too predominant evaluation factor." The capability number was honest. The weighting was not.

Audit checklist

  1. List every signal that contributes to the composite score.
  2. For each signal, find the last time its weight was changed.
  3. For each weight, find the outcome it was calibrated against. If the answer is "feel," that signal is overweighted or underweighted by default.
  4. Pull realized outcomes from the last quarter. Compare predicted score to observed correctness. Rebalance.
02

The aggregation

A composite hides a load-bearing weak signal.

Rarely audited

When the composite ships as a single number, the consumer cannot see which signal is doing the work. A high score can be carried by four healthy signals while the fifth, the load-bearing one, is silently failing. The aggregation hides exactly the failure the score was supposed to surface.

The average-depth problem

"Never cross a river if it is on average four feet deep."Nassim Taleb

AVERAGE DEPTH · 3 FTActual: 10 ftyou drown here

Same shape, different surface. Four signals at 92 and one at 35 average to 80. The drowning happens at the 35.

Trust score

81

Average view

Act

This is the number the consumer sees. The pipeline broke upstream of one of the five signals 14 hours ago.

Pipeline fails upstream. Data is 14 hours stale. Freshness check still passes because it monitors the cube, not the source. All five signals look green in the composite. Score says 80. Reality is 35. A leader acts. Nobody catches it for days.

Audit checklist

  1. Show the composite alongside each contributing signal, not as a single number.
  2. For each query, surface which signal is the lowest in the bundle.
  3. If any contributing signal is below a floor, hold the composite back even when the average looks healthy.
  4. Track decision-time consumption of the score against the floor breach rate.
03

The time gap

The score was honest at construction. The world moved.

Most architectures cover this

The time gap is the place most architectures already cover. Freshness flags expire stale signals. Decay curves discount older readings. Re-pull cadences refresh upstream sources. Solved enough that the conversation has moved on. If your composite is missing time-gap coverage in 2026, that is a separate problem.

Example

Definition drift after a metric owner change

Metric owner changes the formula on a Tuesday afternoon. Six months of behavioral signal becomes irrelevant in a single commit. Decay and severity classifier catch this. Most modern data platforms have an answer here.

Walk the severity classifier →

Audit checklist

  1. Confirm every signal has a freshness flag tied to its actual source, not its render layer.
  2. Confirm decay curves discount older interaction data.
  3. Confirm a severity classifier exists for definition changes and propagates a confidence drop.
  4. If this audit is clean, move attention back to places one and two.

The fix

Structural, not parametric.

Tie signal weights to outcomes. Recalibrate against what happened, not what felt important when the schema was drawn. A trust score nobody audits against outcomes is a check engine light that has been on for two years. Decoration, not signal.

Part of: Stage 03 · Grading Itself

Back to the map

A contract that cannot grade itself is decoration. The architecture audits its own trust scores against what actually happened, because a score nobody checks against outcomes is a check engine light that has been on for two years.

Read the essays