KB / operations
Source Health
Last verified
Every fetch the platform runs appends a row to the source_runs table. The aggregate over a trailing window drives the single system-wide “are we healthy” reading on /api/v1/status and the dashboard’s StatusDot.
The source_runs row
record_source_run(source, ok, error, degraded) in app/signals/database.py writes a row with four columns:
| Column | Meaning |
|---|---|
source | The source name (market, fred, darkpool, breadth, …) |
ok | 1 for success, 0 for failure |
degraded | 1 when the fetch failed but a cached fallback was served — counts as ok=1 |
error | Truncated error message (≤500 chars) on failure, NULL on success |
run_at | ET-aware ISO 8601 timestamp (Law 1) |
Retention
source_runs is the one signal table with a default prune. Rows older than 30 days are deleted at the start of every cycle. Override via RETENTION_SOURCE_RUNS_DAYS (set to 0 to disable entirely). The /api/v1/status endpoint reads only a 7-day window, so the 30-day default is a comfortable buffer — anything beyond the retention horizon is dead weight.
Every other signal-history table is keep-forever by default; see the Retention policy in the repo CLAUDE.md for the full table.
The 95 / 80 escalation rule
/api/v1/status aggregates the trailing 7 days and computes sources_healthy_pct_7d. That percentage maps to a single overall status level:
Top-level status bands
The escalation is one-way per layer: any item in issues forces the top-level level to red; any item in warnings (with no issues) forces yellow; otherwise green. Other escalators feed the same logic — expiring Schwab tokens, the live SPY quote check, missing recent reports — so a 100% sources_healthy_pct_7d can still land yellow or red if something else is wrong.
What surfaces where
GET /api/v1/status— top-level health reading. Returnslevel,schwabtoken/data status,sources_healthy_pct_7d,total_source_runs_7d,degraded_runs_7d, and a per-sourcesource_failures_7dbreakdown.GET /api/v1/source-health— per-source detail: today’s request count, failure count, remaining rate-limit budget (where applicable). No source currently carries a daily budget cap — the AlphaVantage sentiment 25-req/day budget retired with the source (2026-06, DOCTRINE D22).GET /api/v1/source-health/trends— rolling per-source reliability series over a configurable window (1–180 days).- Dashboard StatusDot — polls
/api/v1/statusand renders the same green/yellow/red band as a dot in the dashboard header. Hover for the most recent percentage.
Schwab is special
Schwab health splits two ways because the token and the data both need to be working independently:
token_status∈ unreadable — based on the on-disk refresh token state.data_status∈ unknown — based on a live SPY quote probe against the trading client at request time.
The combined schwab string reads healthy when data flows, the token state when the token is expired/missing, and token_ok_client_failing when the token file is fine but the live probe fails. That last state is deliberate: it means the trading Schwab client is wedged while the signals pipeline — which uses a separate Schwab client and keeps running — is unaffected. It is named precisely so it never reads as a platform-wide data outage.
The trading client self-heals. It tracks its own call health and, after a few consecutive failures or a stale window, rebuilds and re-reads the token file — so a token re-auth recovers without a restart. /status also returns a trading_client sub-object (initialized, consecutive_failures, seconds_since_success, refresh_failing, alarm). A sustained wedge the self-heal cannot clear (for example, the on-disk refresh token is itself dead) raises the alarm flag, a clearly-labeled warning, and a CRITICAL log — that is the signal to re-authenticate and, if it persists, restart the backend.
A token expiring within 12 hours escalates to a warnings entry. Schwab being completely down doesn’t take down the report cycle — the market source falls back to yfinance for everything except the option-chain-only signals (GEX, ZGL, PCR, gamma walls) which have no fallback path.
Operating posture
A healthy system sits around 99% over the 7-day window — the occasional yfinance flake and rate-limited Alpha Vantage tick eat a few percent on a bad day. A persistent dip below 95% usually means one specific source is degraded:
- Check
source_failures_7d— which source is the loudest contributor? - Check
/api/v1/source-healthfor that source — is the budget exhausted, the upstream down, or the token expired? - Cross-reference against the source’s release calendar — a weekly source (
cot,aaii) only fires a handful of times per week, so a single failure has outsized impact on its 7d percentage.
See also
- Lifecycle — where
source_runsrows get written (Stage 1,fetch_all_sources). - Data sources — the inventory each row tracks.
- Event feed — the unified
/api/v1/events/feedincludessource_failureas a kind alongside alerts, transitions, and news. - Code:
app/signals/database.py:record_source_run/get_source_health_7d/get_source_health_detailed. Escalation logic inapp/main.py:statusroute.