KB / framework
Hybrid analogue matcher (Stream C2)
Last verified
The base-rate matcher is the platform’s flagship analytical surface — “when markets looked like this in the past, what happened next.” Stream C2 (May 2026) restructured how the matcher finds analogues: instead of distance-only ranking, the matcher now offers a tag pre-filter that picks up “structurally identical” days before the Euclidean leg ranks them by similarity. Per DOCTRINE §6 Q1 option C, P5 (hybrid > either extreme), and P8 (additive — the default stays soft; hybrid opts in).
The three modes
Every call to /api/v1/signals/base-rates accepts a mode= query parameter. The full catalog:
| Mode | Filter | Rank | When to use |
|---|---|---|---|
hard | Bin equality across all dims | Match-count | Legacy. Kept for comparison. |
soft | None | Weighted Euclidean over all dims | Default. The historical workhorse. |
euclidean | None | Weighted Euclidean over all dims | Alias for soft. Same behaviour, clearer name. |
tag | Coarse-tag overlap | Recency (no Euclidean) | When you want “every structurally similar day” without distance ranking. |
hybrid | Coarse-tag overlap | Weighted Euclidean within the filtered set | Directional default. Picks up structurally identical days, then ranks by closeness. |
Default stays soft for backwards compatibility. hybrid is opt-in until the calibration loop validates it. Per DOCTRINE P8 — additive, not destructive.
The coarse-tag filter
tag and hybrid both filter history by overlap with today’s five-tag coarse subset (declared as _HYBRID_FILTER_TAGS in app/signals/base_rates.py):
gex— options dealer gamma exposure (sign + magnitude)energy_regime— energy-shock state (drives correlation regime)dix— institutional dark-pool buyingvix_regime— volatility regime classificationzero_dte_pcr— same-day options put/call ratio
Five tags, not the full 18 the signal_tags table emits per cycle. Per DOCTRINE §6 Q1 recommendation: filtering on all 18 fragments the analogue pool — five coarse components capture the “structurally identical day” signature without over-shrinking the sample.
A history row passes the filter iff for every tag in today’s coarse subset, the row’s recorded scale-level matches today’s exactly. “Exact overlap” semantics — no partial-match leniency at this stage. The C1 signal_tags table is the substrate; the matcher reads (component, scale_level) per trade_date from there.
The fallback rule
The filter is exact-overlap, no leniency. On most days the historical pool with a coarse-tag overlap is comfortably above the statistical-power threshold; on outlier days (e.g. PANIC with deep-negative GEX + crisis energy regime) the overlap can collapse to single digits.
When the filter under-shoots _HYBRID_MIN_ANALOGUES (default 30 analogues), the matcher falls back to Euclidean over the unfiltered history and emits two response fields:
{
"hybrid_fallback": true,
"hybrid_fallback_reason": "12 analogues after tag filter; minimum 30"
}
Per DOCTRINE P0 — no silent degradation. A response with hybrid_fallback=true tells the consumer “you asked for hybrid; the filter produced too few analogues to be statistically useful; the matcher fell back to Euclidean over the full history.” Dashboards and agents should surface this state visibly. Per the working agreement, the user must always know when a hybrid call was downgraded.
The 30-analogue floor calibrates against the live history depth: ~2 years of trading days at 250/year leaves ~500 rows; five-tag exact-overlap typically retains 50-150 for an in-distribution day, dropping to single digits on the rare outlier setups where the pool legitimately should be thin. 30 preserves enough power for the bootstrap CI on the mean to be meaningful.
Why not hierarchical fallback?
DOCTRINE §6 Q1 considered three options:
- Option A — Hierarchical drop: when the five-tag filter fragments, automatically relax to a four-tag filter, then three-tag, etc.
- Option B — Wider coarse filter (8-10 tags).
- Option C — Five-tag exact filter, fall back to Euclidean over full history when under-shot.
C ships. The Q1 narrative: hierarchical drop hides which tags the matcher relaxed, surfacing a smaller sample without explaining why; B fragments the sample too aggressively to start; C exposes the failure mode (fallback) loudly via the response field and lets the operator decide whether to widen the filter later. If exact-overlap proves too restrictive in practice (the fallback fires more than ~10% of cycles), C2’s follow-on can introduce option A — but not in this PR. Stay tight on the brief.
The matcher version
Every prediction the matcher emits carries MATCHER_VERSION:
2026.05.9-hybrid-15d-recency730
Decoded:
2026.05.9— date code; bumped from2026.05.8when C3 (wall-proximity dims) landed.hybrid— the dispatch surface widened to include the C2 modes (even though the default stayssoft).15d—_SOFT_NUMERIC_DIMScount. C3 (May 2026) added the two wall-proximity dims (spy_dist_to_call_pct+spy_dist_to_put_pct) once the schema-gap unblock landed; live capture writes both columns and the one-shotscripts/backfill_wall_proximity_daily.pyfilled history.recency730— defaultrecency_half_life_daysvalue.
Bump rules live in the # Bump the version when: comment block in app/signals/base_rates.py. The contract is binding: matcher logic changes mean a version bump, behaviour-identical refactors don’t. After every bump, python3 scripts/backfill_base_rate_predictions.py --force re-seeds the calibration substrate so the realised-vs-forecast curve isn’t polluted by stale-matcher predictions.
Reading the response
Calls with mode in {tag, hybrid} add four fields to the standard compute_base_rates() payload:
{
"matcher_version": "2026.05.9-hybrid-15d-recency730",
"mode": "hybrid",
"sample_size": 87,
"hybrid_filter_tags": {
"gex": "favorable",
"energy_regime": "favorable",
"dix": "leaning",
"vix_regime": "favorable",
"zero_dte_pcr": "favorable"
},
"hybrid_filtered_count": 87,
"hybrid_fallback": false,
"hybrid_fallback_reason": null,
"forward_1d": { /* ... */ },
/* full standard payload */
}
hybrid_filter_tags— the coarse tag set used for the filter (echoes today’s classifier output across the five components).hybrid_filtered_count— analogues after the filter, before any top-K cap. When fallback fired this carries the count that triggered the fallback (not 0).hybrid_fallback— bool. True iff the matcher fell back to Euclidean over unfiltered history.hybrid_fallback_reason— human-readable string when fallback fired;nullotherwise.
How to think about the trade-off
soft ranks every historical day by similarity and picks the top 100. hybrid first throws away every day whose coarse-tag signature differs from today, then ranks the survivors. The two answer different questions:
- soft — “Among all 1300 history days, what are the 100 closest by weighted Euclidean distance?” The 100 might include days from totally different regime regimes if their numeric vectors happen to be close.
- hybrid — “Among the ~80 days whose structural signature matched today, what are the 50 closest by Euclidean distance?” The answer respects regime / squeeze fuel / dark-pool state as a hard filter, then ranks distance within that set.
Per DOCTRINE P5, hybrid > either extreme: pure tag matching is too coarse (no distance ranking); pure Euclidean is too leaky (a structurally different day can win on numeric similarity alone). The five-tag coarse filter encodes the “same kind of day” question; Euclidean then encodes “how close within that kind.”
See also
- Compare to history (MCP tool) — the MCP exposure of the matcher. Picks up the mode dispatch automatically.
- Base rates API — the HTTP route surface + query parameters.
- Health Score — the scoring system the coarse-tag filter dims also feed.
- Implementation:
app/signals/base_rates.py(matcher),app/signals/signal_tags.py(substrate),app/signals/routes.py(route handlers).