Calibration
GENYS measures how closely probabilities align with reality. Calibration is not a feature — it is the product.
What GENYS Measures
Brier Score
Mean squared error across all resolved predictions. The single most important accuracy metric. 0.0 = perfect, 1.0 = worst possible.
Mean Absolute Error
Average distance between predicted probability and actual outcome. Intuitive: "on average, you are off by X%."
Directional Bias
Signed average error. Positive means you overestimate (overconfident). Negative means you underestimate (underconfident).
Confidence Bucket Accuracy
10 buckets (0–10%, 10–20%, … 90–100%). For each range, what percentage of outcomes actually occurred? Requires ≥10 decisions per bucket to report.
Category Calibration
Separate Brier, bias, and error per decision type (crypto, pricing, channel, political, etc.). Requires ≥5 decisions per category.
Rolling vs Lifetime
Rolling calibration over the last 50 decisions vs all-time. Shows whether your accuracy is improving, stable, or degrading.
Multi-Agent Comparison
When multiple probability sources exist on the same decision, GENYS computes error and calibration for each independently. You can answer questions like:
Market Comparison (when Polymarket is linked)
Your Brier Score
0.214
Market Brier Score
0.182
You are 3.2pp worse than market baseline on decisions with linked Polymarket data.
Worked Example
Will ETH break $4,000 by March 31?
Your error: |0.65 - 0| = 0.65. Brier contribution: 0.65² = 0.4225
GPT-4o error: 0.72. Claude error: 0.61. Market error: 0.68.
Claude was the most accurate source on this decision.
Calibration Drift
A single prediction tells you nothing about accuracy. But after 20, 50, 100 decisions, patterns emerge. GENYS detects these patterns automatically:
• “When you predict 60–70%, outcomes occur 52% of the time.” — You overestimate in this range by 13pp.
• “On crypto decisions, you overestimate by 18pp on average.” — Category-specific bias.
• “Your rolling Brier is 0.19 vs lifetime 0.24. Trend: improving.” — You are getting better.
GENYS surfaces this as a diagnostic, not a correction. It tells you where you are wrong. It does not tell you what to believe. You make the adjustment. The system measures whether it worked.
Why This Matters
Without calibration, every estimate is just an opinion. With calibration, estimates become a measurable track record. Over time, well-calibrated forecasters earn trust because their probabilities mean what they say. Poorly calibrated sources can be identified and weighted accordingly.
This is the difference between “I think it will work” and “When I say 70%, it happens 68% of the time.”