ESSAY · 03

Why detection compounds: the data flywheel.

The most common question we get from infrastructure investors: what stops a competitor with the same data from doing this in six months? The honest answer is nothing stops them in six months. What stops them in six years is something else.

Speed is a moat for six months

If the moat is speed — being first to market with an aviation oracle — the lead has a half-life. A well-funded competitor with a strong engineering team can replicate the data ingestion pipeline, the threshold-detection logic, and the API surface in roughly six months. We say this without bravado; we have built variants of this stack before, we know the timeline.

Six months from now, multiple oracles will have the same first-order capability. Detect a delay above 15 minutes? Anyone can do that. Push a webhook on settlement? Anyone can do that. The interesting question is what happens at month seven and beyond.

Data is a different kind of moat

The asymmetry shows up in the second-order detection problems. Threshold-bound delays are easy. Cancellations versus very-late arrivals — that's harder, because the underlying process is identical until a specific operational decision is made. Diversions versus precautionary holds — also hard. Cascading delays where the aircraft is one input to a downstream flight — much harder, requires modeling the network, not the flight.

These second-order problems are where ML-based detection beats heuristics. And ML-based detection is dependent on labeled training data. Specifically: labeled training data of the exact kind the model is trying to predict.

Aviax's training corpus is eight years of historical flight pattern data, partitioned by airline, route, season, weather regime, airport pair, day of week, time of day. Every flight in that corpus has known outcomes — known delay tier, known cancellation status, known diversion. That is the seed dataset.

The flywheel: every settlement labels training data for the next

The interesting moment is the first day Aviax goes live with markets. Day one, model accuracy is whatever the eight-year seed produced. Day two, the model has new data: yesterday's 100,000 flight outcomes, which the model predicted at some confidence, which then resolved.

Each resolved market is a labeled training example. The label is what actually happened (the outcome). The features are the conditions that existed at prediction time. The model retrains on yesterday's actuals to improve tomorrow's accuracy. This is not novel — it's standard supervised-learning. What's novel is the rate.

100,000 labeled examples per day

That's 36.5M new training examples per year. By day 365, the corpus has grown by 50% from its starting point. By day 730, it has nearly doubled. By year 5, the corpus is dominated by post-launch data, not historical data.

Why the lead doesn't shrink — it grows

A naive analysis would say: a competitor that launches six months later catches up because they get the same daily flow of labels. That's true for absolute corpus size. It's not true for accuracy.

The reason is non-stationarity. Aviation operations don't repeat the same patterns year over year — airlines change route networks, airports change capacity, weather patterns shift, regulatory regimes evolve. A model trained on eight years of historical data plus six months of recent flow has a richer view of regime shifts than a model trained on one year of recent flow alone.

The richer view matters for tail outcomes specifically. Common delays (15-minute thresholds in normal weather) are easy to predict from any reasonable corpus. Rare outcomes (cascading cancellations during weather events, mid-route diversions, network-wide ground stops) are visible only with enough history to have multiple instances of each rare regime.

The competitor who launches six months later has the same daily flow, but their training corpus is 6 months × 100K = 18M labels. Aviax's corpus on day-of their launch is 8 years × 100K + 6 months × 100K = 310M labels. The competitor has 6% of Aviax's labeled corpus on the day they launch. Year three, the competitor has 60% of Aviax's corpus. The competitor never catches up to absolute parity, because Aviax keeps gathering labels at the same rate while the competitor is still amortizing the eight-year head start.

Why airline operations data is the wrong substrate

One challenge to the moat argument: legacy flight data providers (Cirium, OAG, FlightAware) have been collecting flight outcome data for decades. Don't they already have everything Aviax has?

They have most of the raw data. They don't have it in the shape we need.

Legacy providers' schema is built around airline operations centers — crew rotation, gate management, fuel planning. Their resolution latency is measured in minutes (acceptable for ops decisions), not in seconds (required for settlement). Their outcome shape requires significant transformation before it can drive a settlement signal — for instance, "delay" is recorded as a continuous minute count, not as a boolean threshold check, and the source-of-truth between airline-reported and signal-observed times conflicts in non-trivial ways.

The result: a competitor using legacy providers as input has cleaner historical data than Aviax does, but a longer integration path before the data drives a settlement-grade decision. They are six months behind on the model AND have to spend twelve months on schema transformation before the data is usable.

This is the second-order moat. Aviax's substrate is built ground-up for settlement, which is why six months of post-launch labels are worth more than two years of legacy operations data, in terms of what they can predict.

Speed for six months. Data for six years.

The honest pitch: do not invest in Aviax because of the six-month speed lead. That window closes. Invest in Aviax because of the eight-year corpus and the post-launch flow that compounds at 36.5M labels per year. That window doesn't close. It widens.

This is what the founders mean when we say "the lead compounds." It is not marketing speak. It is the specific mathematical property of supervised-learning systems trained on labels generated by their own outputs, in a non-stationary domain. Aviax was designed around this property.

Why this is now possible

The eight-year corpus exists because aircraft signal observations have been continuous and verifiable since roughly 2018. Before that, coverage was patchy enough that the historical record had gaps. The continuous-coverage threshold combined with detection model maturity (transformer architectures applied to time-series, retraining infrastructure that can ingest 100K labels/day without manual cleaning) all reached production-readiness in the last 18 months. The flywheel can spin now. It could not spin three years ago.

Request pilot