Hybrid analogue matcher (Stream C2)

Last verified2026-05-21

The base-rate matcher is the platform’s flagship analytical surface — “when markets looked like this in the past, what happened next.” Stream C2 (May 2026) restructured how the matcher finds analogues: instead of distance-only ranking, the matcher now offers a tag pre-filter that picks up “structurally identical” days before the Euclidean leg ranks them by similarity. Per DOCTRINE §6 Q1 option C, P5 (hybrid > either extreme), and P8 (additive — the default stays soft; hybrid opts in).

The three modes

Every call to /api/v1/signals/base-rates accepts a mode= query parameter. The full catalog:

Mode	Filter	Rank	When to use
`hard`	Bin equality across all dims	Match-count	Legacy. Kept for comparison.
`soft`	None	Weighted Euclidean over all dims	Default. The historical workhorse.
`euclidean`	None	Weighted Euclidean over all dims	Alias for `soft`. Same behaviour, clearer name.
`tag`	Coarse-tag overlap	Recency (no Euclidean)	When you want “every structurally similar day” without distance ranking.
`hybrid`	Coarse-tag overlap	Weighted Euclidean within the filtered set	Directional default. Picks up structurally identical days, then ranks by closeness.

Default stays soft for backwards compatibility. hybrid is opt-in until the calibration loop validates it. Per DOCTRINE P8 — additive, not destructive.

The coarse-tag filter

tag and hybrid both filter history by overlap with today’s five-tag coarse subset (declared as _HYBRID_FILTER_TAGS in app/signals/base_rates.py):

gex — options dealer gamma exposure (sign + magnitude)
energy_regime — energy-shock state (drives correlation regime)
dix — institutional dark-pool buying
vix_regime — volatility regime classification
zero_dte_pcr — same-day options put/call ratio

Five tags, not the full 18 the signal_tags table emits per cycle. Per DOCTRINE §6 Q1 recommendation: filtering on all 18 fragments the analogue pool — five coarse components capture the “structurally identical day” signature without over-shrinking the sample.

A history row passes the filter iff for every tag in today’s coarse subset, the row’s recorded scale-level matches today’s exactly. “Exact overlap” semantics — no partial-match leniency at this stage. The C1 signal_tags table is the substrate; the matcher reads (component, scale_level) per trade_date from there.

The fallback rule

The filter is exact-overlap, no leniency. On most days the historical pool with a coarse-tag overlap is comfortably above the statistical-power threshold; on outlier days (e.g. PANIC with deep-negative GEX + crisis energy regime) the overlap can collapse to single digits.

When the filter under-shoots _HYBRID_MIN_ANALOGUES (default 30 analogues), the matcher falls back to Euclidean over the unfiltered history and emits two response fields:

{
  "hybrid_fallback": true,
  "hybrid_fallback_reason": "12 analogues after tag filter; minimum 30"
}

Per DOCTRINE P0 — no silent degradation. A response with hybrid_fallback=true tells the consumer “you asked for hybrid; the filter produced too few analogues to be statistically useful; the matcher fell back to Euclidean over the full history.” Dashboards and agents should surface this state visibly. Per the working agreement, the user must always know when a hybrid call was downgraded.

The 30-analogue floor calibrates against the live history depth: ~2 years of trading days at 250/year leaves ~500 rows; five-tag exact-overlap typically retains 50-150 for an in-distribution day, dropping to single digits on the rare outlier setups where the pool legitimately should be thin. 30 preserves enough power for the bootstrap CI on the mean to be meaningful.

Why not hierarchical fallback?

DOCTRINE §6 Q1 considered three options:

Option A — Hierarchical drop: when the five-tag filter fragments, automatically relax to a four-tag filter, then three-tag, etc.
Option B — Wider coarse filter (8-10 tags).
Option C — Five-tag exact filter, fall back to Euclidean over full history when under-shot.

C ships. The Q1 narrative: hierarchical drop hides which tags the matcher relaxed, surfacing a smaller sample without explaining why; B fragments the sample too aggressively to start; C exposes the failure mode (fallback) loudly via the response field and lets the operator decide whether to widen the filter later. If exact-overlap proves too restrictive in practice (the fallback fires more than ~10% of cycles), C2’s follow-on can introduce option A — but not in this PR. Stay tight on the brief.

The matcher version

Every prediction the matcher emits carries MATCHER_VERSION:

2026.06.5-hybrid-15d-recency730-symmetric-rms-covgate-tradingday-topk100

Decoded:

2026.06.5 — date code; bumped when the served analogue-pool size (top_k) was pinned into the single served-config constant.
hybrid — the dispatch surface widened to include the tag-prefilter modes (even though the default stays soft).
15d — numeric distance-dimension count. The set reached 16 once the wall-proximity pair and the change-form credit dim were added, then dropped to 15 in June 2026 when an asymmetric stagflation dimension (present on recent rows only, ~7% historical coverage) was removed pending a historical backfill. The static dimension list is still 15 — the coverage gate below decides which of those 15 actually participate per call, but it doesn’t add or remove tuple members.
recency730 — default recency_half_life_days value.
symmetric — the June 2026 fix closed a live-vs-historical dimension asymmetry: two distance dimensions (the SPY/VIX rolling correlation and the VIX term-structure slope) were only ~5% covered on historical rows, so the matcher silently treated them as missing when comparing today’s live setup against the past. They were backfilled to ~95% from data the platform already stores plus a free public volatility index, so the analogue pool is drawn from a comparable population.
covgate — coverage gate (exclude-until-backfilled). A simulated-sparsity out-of-sample study found the old skip-if-null policy — where the distance loop silently skipped a dimension on any pair missing it — was the worst of four tested options, because a pair that happened to overlap on a thin live-only dimension got that dimension counted at full weight, so the most-NULL-but-coincidentally-overlapping rows looked nearest. The fix: a dimension participates only when its coverage across the history is at least 55%. On current data this keeps the change-form credit dimension (~60% covered) and drops the three live-only positioning dimensions that have no honest deep history (each below 10% covered). The excluded set is surfaced on the response so the exclusion is visible, never silent.
rms — root-mean-square normalization. The numeric squared-distance sum is divided by the number of dimensions actually used for each pair before the square root. This fixes the “most-NULL rows look nearest” bias on mixed-coverage pairs: under the old raw-Euclidean sum, a pair compared on fewer dimensions accumulated fewer terms and looked artificially close. Categorical (regime) penalties are added after the numeric average — a regime mismatch is an absolute penalty, not a per-dimension contribution that should shrink with dimension count.
tradingday — trading-day floor. Non-trading-day skeleton rows (holiday/weekend rows carrying only calendar-stamped macro series) are excluded from the analogue pool, so a top-K can no longer fill up with thin holiday rows.
topk100 — the served analogue-pool size (top 100 nearest) is part of the pinned served config: the live route, the calibration capture, and the reseed path all grade over the same pool size, so the published calibration curve grades the same estimator users see.

Bump rules live in the # Bump the version when: comment block in app/signals/base_rates.py. The contract is binding: matcher logic changes mean a version bump, behaviour-identical refactors don’t. After every bump, python3 scripts/backfill_base_rate_predictions.py --force re-seeds the calibration substrate so the realised-vs-forecast curve isn’t polluted by stale-matcher predictions.

Reading the response

Calls with mode in {tag, hybrid} add four fields to the standard compute_base_rates() payload:

{
  "matcher_version": "2026.06.5-hybrid-15d-recency730-symmetric-rms-covgate-tradingday-topk100",
  "mode": "hybrid",
  "sample_size": 87,
  "hybrid_filter_tags": {
    "gex": "favorable",
    "energy_regime": "favorable",
    "dix": "leaning",
    "vix_regime": "favorable",
    "zero_dte_pcr": "favorable"
  },
  "hybrid_filtered_count": 87,
  "hybrid_fallback": false,
  "hybrid_fallback_reason": null,
  "forward_1d": { /* ... */ },
  /* full standard payload */
}

hybrid_filter_tags — the coarse tag set used for the filter (echoes today’s classifier output across the five components).
hybrid_filtered_count — analogues after the filter, before any top-K cap. When fallback fired this carries the count that triggered the fallback (not 0).
hybrid_fallback — bool. True iff the matcher fell back to Euclidean over unfiltered history.
hybrid_fallback_reason — human-readable string when fallback fired; null otherwise.

Every Euclidean-bearing mode (soft / euclidean / hybrid) additionally carries excluded_dims — the list of distance dimensions dropped by the coverage gate, each as {field, coverage}. It is [] when every dimension cleared the 55% coverage floor. On the live population it typically lists the three live-only positioning dimensions, so a reader can see exactly which dimensions did and didn’t condition the match.

How to think about the trade-off

soft ranks every historical day by similarity and picks the top 100. hybrid first throws away every day whose coarse-tag signature differs from today, then ranks the survivors. The two answer different questions:

soft — “Among all 1300 history days, what are the 100 closest by weighted Euclidean distance?” The 100 might include days from totally different regime regimes if their numeric vectors happen to be close.
hybrid — “Among the ~80 days whose structural signature matched today, what are the top 100 (or fewer, when the filtered set is smaller) closest by Euclidean distance?” The answer respects regime / squeeze fuel / dark-pool state as a hard filter, then ranks distance within that set.

Per DOCTRINE P5, hybrid > either extreme: pure tag matching is too coarse (no distance ranking); pure Euclidean is too leaky (a structurally different day can win on numeric similarity alone). The five-tag coarse filter encodes the “same kind of day” question; Euclidean then encodes “how close within that kind.”