Familiarity Beats Cleverness
A custom format designed to save tokens consumed 738% more of them at scale. The lesson isn't about YAML — it's about what happens when you optimize against the grain of the substrate.
Someone built a better format.
They called it TOON — a custom syntax designed from scratch to be token-efficient. Fewer characters per schema definition. Tighter encoding. On paper, it was objectively superior to YAML for feeding database schemas to language models. A team could save 25% on tokens just by switching formats. The engineering was sound. The logic was clean. The optimization was real.
At 10,000 tables, TOON consumed 738% more tokens than YAML.[^1]
Not a marginal regression. Not a tradeoff. A catastrophic inversion — the optimized format performing nearly eight times worse than the conventional one it was designed to replace.
The Grep Tax
Damon McMillan ran 9,649 experiments across 11 models, 4 file formats, and SQL schemas scaling from 10 to 10,000 tables. The goal was straightforward: find the best way to represent database schemas for AI agents. The results were not.
At small scale, TOON worked. Ten tables, fifty tables — the models handled the unfamiliar syntax adequately. But as schemas grew, something broke. The models couldn't parse TOON fluently. They'd attempt a query, fail, attempt a correction, fail again, enter a refinement loop — each failed attempt burning tokens. McMillan calls this the "grep tax": the cost of making a model search through a format it doesn't recognize.
YAML didn't have this problem. Not because YAML is elegant — anyone who's wrestled with whitespace sensitivity knows it isn't. YAML worked because every model in the study had ingested millions of YAML files during training. The syntax was already in the weights. Parsing it wasn't a task; it was a reflex.
The 738% gap isn't about file formats. It's about what happens when you build against the grain of the material.
Cleverness as Force
There's a builder's instinct that runs deep: optimize, compress, improve. You see waste, you engineer it out. You see a convention that's 25% larger than necessary, and you design a tighter alternative. This instinct is useful. It's built most of what works in the world.
But it has a blind spot. It assumes the substrate is neutral — that the tool doesn't care what format the input arrives in, only how much of it there is. With traditional software, that's mostly true. A parser handles whatever grammar you define. The machine doesn't have preferences.
Language models have preferences.
They have preferences because they have training distributions, and those distributions are wildly uneven. YAML appears in millions of repositories, configuration files, CI pipelines, Kubernetes manifests. A model encountering YAML is like a carpenter encountering wood — the material is familiar, the handling is automatic, the effort is minimal. A model encountering TOON is like encountering a synthetic composite it's never touched. The material might be superior on a spec sheet, but the hands don't know it yet.
This is the insight the 738% gap encodes: optimizing a format the model can't read fluently doesn't save tokens — it multiplies them. Each failed parse attempt triggers a refinement cycle. Each refinement cycle burns tokens and context. The cleverness that was supposed to reduce cost becomes the primary source of cost.
Coherenceism has a principle for this: Alignment over Force.[^2] Don't push the river. Position yourself so the current carries the work forward. TOON was force — elegant, well-reasoned force, but force nonetheless. YAML was alignment — clumsy, verbose alignment with what the model already knew. The river won.
And the principle generalizes far beyond file formats. When you write prompts in compressed shorthand that departs from natural instruction patterns — you're paying the grep tax. When you organize a project in a novel directory structure that's unfamiliar to every model trained on src/, tests/, docs/ — you're paying the grep tax. When you invent a custom DSL instead of using Markdown that appears in every README on GitHub — you're paying the grep tax.
The tax is invisible at small scale. Ten tables, ten files, ten prompts — your clever format works fine. But the gap between familiar and unfamiliar doesn't scale linearly. It scales explosively. At small scale, a model compensates with a few extra inference steps. At large scale, those steps cascade into failure loops that compound. Conventional formats amplify what the model already knows; exotic formats introduce friction that grows with every additional table, file, and prompt.
The practical rule is blunt: choose the format that appears most often in training data. Not the most efficient. Not the most elegant. The most familiar.
- YAML over custom DSLs
- Markdown over proprietary markup
- Standard project layouts over novel ones
- Conventional API patterns over clever abstractions
- Natural language instructions over compressed shorthand
Convention isn't laziness. Convention is a shared language — and the model is a native speaker.
Test the Substrate, Not the Spec Sheet
The model has opinions you didn't put there and can't remove. Its training distribution shaped preferences that are structural, not optional. You can't override them with a better design. You can only test for them — and most teams don't.
McMillan did. 9,649 experiments. Eleven models. Schemas from ten tables to ten thousand. The result isn't a format recommendation — it's a methodology: empirically test your infrastructure choices against the model's actual behavior at the scale you intend to operate. Not at prototype scale, where everything works. At production scale, where the grep tax comes due.
Don't build clever. Build familiar. The model will do the rest.
Sources: Damon McMillan — 'Structured Context Engineering for File-Native Agentic Systems' via Simon Willison (February 9, 2026)