Pythia v0.5.0: Built for Regime Change

The standard way to keep an intraday signal sharp is to retrain it often. Roll the training window forward, refit every few weeks or every quarter, and let each model specialize in the conditions it's most likely to face in the near term. It's a reasonable approach and it's what we did with prior versions of Pythia.

It has a structural weakness: every model in the chain is optimized for the regime that just ended. When the regime shifts — a vol spike, a rate surprise, a rotation across sectors — the model has to wait for the next refit to see it. And the refit itself can lock in features of a window that was already starting to fade.

Pythia v0.5.0 takes a different approach. One model, trained once on four years of limit-order-book data, never refit. This post is about what that change looks like, why we made it, and what we expect it to do for live performance.

What changed in the training process

Three concrete shifts.

Trained once on 4 years of data, validated out-of-sample on 16 months. The training window covers 2021 through 2024 — the COVID-era recovery, the 2022 rate-hiking cycle, the 2023 banking stress, the 2024 election cycle. Multiple regimes, one model. We then froze it and validated on 2025 and Q1 2026 — sixteen months the model never saw during training. That out-of-sample window is what makes the rest of the numbers in this post meaningful.

Context window: 20 days, up from 3. A model with three days of context can recognize whether the last few sessions have been calm or jumpy. It can't easily tell whether the broader regime has been calm or jumpy. Twenty days gives it enough to distinguish the two. We tested several lengths; twenty was where the marginal benefit started to flatten out without making the model unwieldy.

Optimization objective: day-level P&L, not per-minute accuracy. Previous Pythia versions optimized for the accuracy of every minute-by-minute prediction. The new version optimizes for cumulative profit across the full session — 9:30 AM to 4:00 PM ET. The change matters because that's how the signal is actually used. A model that's slightly less accurate per-minute but produces a stronger end-of-day P&L curve is the model customers want.

None of these are exotic ideas. The interesting part is what happens when you combine them with the decision to stop refitting.

The point isn't to beat the prior backtest

A model that's refit every quarter has a structural advantage on its own backtest: each version is tested in the window it was specialized for. That can produce strong numbers that don't reproduce in live trading, because the next regime isn't the one the model was tuned for.

The point of v0.5.0 isn't to beat the prior model in every cell of the backtest table. It's to make backtest performance translate to live performance — to compress the gap between what we see in evaluation and what customers see in production.

That changes how to read the numbers. Q1 2026 — the most recent out-of-sample window — looks like this:

Table 1: Q1 2026 Sharpe ratio — prior model vs Pythia v0.5.0

Signal	Prior model	Pythia v0.5.0
S&P 500	1.25	1.43
Nasdaq-100	0.97	0.43
Russell 2000	1.89	2.89
Dow Jones	−1.38	1.94

The Dow Jones line (−1.38 → 1.94) and the Russell 2000 line (1.89 → 2.89) show what happens when an out-of-sample regime is one the prior model wasn't tuned for. The Nasdaq-100 line is the visible cost of giving up quarterly specialization. We're not arguing around it.

What we expect customers to care about, six months from now, is not any single Sharpe in that table. It's whether the live numbers look like the backtest numbers.

The cross-asset test

The strongest test of "built to generalize" isn't how the model does on the indices it was developed alongside. It's how it does on instruments it wasn't tuned for at all.

The Gold and Silver signals we released in April were trained with the same architecture. The architecture wasn't tuned to commodity microstructure; the same training process that produced the equity-index signals produced these. How they perform out-of-sample over the coming months is the most direct test of whether the approach generalizes.

Seed stability

A reproducible training process should land in the same place each time it's run. The prior approach didn't always: a sweep of seed-equivalent runs — same data, same architecture, only the random seed changed — could produce instances with materially different Sharpe ratios. v0.5.0 tightens that. Retraining with different seeds now produces instances whose Sharpe ratios cluster closely together. The model in production isn't a lucky pick from a noisy distribution; it's representative of any sibling instance trained the same way.

Cadence

One operational note worth surfacing. We ran the same v0.5.0 model at 1-minute, 5-minute, and 10-minute trading cadences across all four equity-index signals. Sharpes came in within roughly 0.05 of each other. The signal isn't a high-frequency artifact that decays if you don't trade every minute — customers can pick a cadence that fits their operations and their cost structure.

What's next

v0.5.0 is the current state of Pythia, not the end state. A few research directions we're actively working on for the next releases:

Reinforcement-learning confidence and sizing. Today the signal is UP / STABLE / DOWN. We're moving to a continuous score from −10 to +10 that captures both direction and confidence, trained via reinforcement learning on full-session P&L. The existing categorical interface keeps working via simple thresholds.
Overnight and pre-market data as input. Today the model only sees the regular session. Extending the inputs to overnight and pre-market hours lets it pick up on what's happening before the open.
Cross-LOB inputs. Predictions for each instrument currently use only that instrument's order book. We're working on letting the model consume multiple order books simultaneously — e.g. feeding S&P 500, Nasdaq-100, Russell 2000, and Dow Jones order books together when predicting any one of them, to capture cross-instrument microstructure.

Each of these is an open research question, not a feature on a roadmap. We'll publish what we find.