IONIAN Blog — AI Engineering

48 Hours With Claude Fable 5 to Build a Language Just for AI

A 48-hour sprint with Claude Fable 5 produced a complete compiler, a deterministic VM, a formal grammar, and a self-verifying simulator — and a tokenizer audit that refuted our own headline token-savings claim. Here's what survived, and where it sits next to the grammar-constrained-decoding literature.

48 Hours With Claude Fable 5 to Build a Language Just for AI

Claude Fable 5 went live on June 9, 2026. On its second day we opened a window, sat down with it, and didn't really get up until the window closed roughly 48 hours later. In that sprint we took an idea from nothing to a tagged v0.9.1: a complete compiler, a deterministic virtual machine, a formal grammar, a self-verifying simulator, byte-exact parity gates — and an honest tokenizer audit that disproved the entire reason we started.

One attribution note up front, because the commit history will be public: Claude Fable 5 built the language — the compiler, the VM, the engine, v0.1 through v0.9. The verification layer this post is actually about — the formal grammar, the self-verifying simulator and assertions, and the tokenizer audit that disproved our own headline — was built the next day with Claude Opus 4.8. Same sprint, two models; we credit each for what it did.

That last part is the point of this post. The sprint was fast and complete and self-critical, all at once. We're not writing it up because it worked. We're writing it up because the most useful thing we produced was the measurement that killed our own headline — and a smaller idea that survived the cut and turns out to sit on a live research question.

If you only take one thing: fast and rigorous are not opposites, and "our pitch was wrong" is a finding, not a failure.

The challenge

The idea was a language called Diell: a dense, machine-first notation that an AI writes and a tiny deterministic engine runs. Not a language for humans — humans would never read it — but a generation target for models. The bet was about cost. LLMs spend a lot of output tokens emitting verbose UI code (React, HTML, the lot). If a model could emit a compact bytecode-like form instead, you'd cut the tokens — and the energy — it takes to build a screen.

So we built the whole stack with Fable 5: an assembler from dense text to a binary container, a 4-phase transactive VM (64 integer registers, a forward-only program counter, a latched fail-state), layout, themes, a flagship tic-tac-toe that fits in about a kilobyte with zero host logic. It runs. It's deterministic. It replays bit-for-bit. You can try it in the browser right now — editor, engine surface, register file, dirty-mask, hex stream, fault injection.

The honest finding (we are not burying this)

Our headline number was a lie we told ourselves with arithmetic.

Early on we estimated token cost the lazy way: characters ÷ 3.75. By that estimate the flagship program was about 593 tokens — comfortably under budget, story checks out, ship it.

Then we did the audit properly. We ran every keyword, sigil, and the whole flagship through the real byte-pair tokenizers models actually use — cl100k_base (GPT-4-class) and o200k_base (GPT-4o-class). Measured:

  • Flagship program: 1,230 tokens (cl100k) / 1,244 tokens (o200k).
  • Our estimate undercounted by about 52%.

Diell's source is sigil-dense — $, !, @, ALL-CAPS mnemonics — and that is precisely the kind of text BPE tokenizers fragment rather than compress. The real density is ~1.8 characters per token, not 3.75. And our own benchmark file already had the inconvenient note buried in it: for tic-tac-toe, idiomatic React's source was actually terser than Diell's in one of the comparisons.

Put that next to where models are in June 2026 — Fable 5 and its peers generate real, working UI code cheaply and well — and the conclusion is unavoidable: the token-and-energy thesis is refuted by our own instrumentation. We are not going to sell you on tokens saved or watts saved. The measurement says don't.

That could have been the end. It wasn't, because of what was left standing after we took the headline away.

What survived

Strip out the cost story and one property remains that we still find genuinely interesting:

Diell is small enough to be fully grammar-constrained and fully simulated at the same time.

  • Fully grammar-constrained. We wrote a complete formal grammar for the language (grammar/diell.gbnf, in the GBNF format llama.cpp and XGrammar consume). Under grammar-constrained decoding (GCD), a local model is masked at each step to only the tokens the grammar permits — so a syntactically invalid program is impossible to emit by construction. Not unlikely. Impossible — under GCD, by construction. (Honest scope: we built and measured the grammar; wiring it to a local model to actually generate under constraint is done but not something we ran at scale this sprint — GCD needs a local open model, since proprietary chat APIs don't expose the per-token logits you mask. See the ledger.)
  • Fully simulated. The same 64-register state vector that makes the language small enough to constrain also makes it small enough to simulate exhaustively. We built a headless simulator and an assertion record: you write ? $4 == 2164 AFTER input $1 1999 right in the source, and the toolchain runs the program in the real engine and checks it. Behavior is verified before anything renders.

Together that's a loop where an AI can self-verify before it emits: generate under the grammar (guaranteed to parse), simulate against assertions (guaranteed to behave), and only then hand a human a running app they never have to read.

Here's the part we actually measured, and the part that matters. We have a suite of 27 malformed programs — including the 7 real mistakes a model made when we first had one translate apps into Diell. We ran the grammar against all 27:

  • 18 of 27 are structural errors a context-free grammar can reject. GCD kills this entire class outright — including all 7 of the observed real-world model mistakes. Under the mask, the model can't even type them.
  • 9 of 27 are a semantic residue the grammar is blind to: writing a protected register, an out-of-range integer, a duplicate or undeclared handler id, a jump to a label that's behind it or doesn't exist, a range that crosses a security boundary. Each needs something a grammar by construction does not have — a symbol table, integer arithmetic, direction-sensitive resolution.

So: GCD makes it valid. The simulator makes it correct. Neither alone is enough. That clean 18/9 seam — between what a grammar can decide and what it can't — is the whole result. It's also, it turns out, a small data point on a question other people are actively working on.

How this sits next to the literature

This is the section a skeptic should attack first, so let me be precise, and let me agree with the prior work rather than claim to beat it. The field has largely already established our central point.

  • Grammar-constrained decoding genuinely helps on narrow, well-specified targets. Raspanti, Ozcelebi, and Holenderski ("Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers," ACL 2025, Industry Track) show GCD improves both syntactic correctness and semantic accuracy on logical parsing, and can even substitute for in-context examples — most useful for smaller models. Theirs is a logical-parsing result, not application code generation — a real domain gap we're naming, not papering over — but it's still evidence for the "make the target small and constrain it" half of our bet, not against it.
  • But constraint distorts the model. Park, Wang, Berg-Kirkpatrick, Polikarpova, and D'Antoni ("Grammar-Aligned Decoding," NeurIPS 2024) show plainly that GCD "can distort the LLM's distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM." A masked model can be herded into a valid-but-worse program. We did not discover this caveat; they named it, and it's exactly the one we flag in our own notes.
  • And syntax was never the hard part. A very recent empirical study (Song, Rajput, Sun, Ezzini, Bissyandé, and Klein, "Empirical Study for Structured Output Control in LLMs for Software Engineering," arXiv:2606.09395, June 2026) finds that even strict template-driven control nearly eliminates syntax errors yet "substantial structural and semantic errors persist," concluding that structure-enforcing tools are "necessary but insufficient." Their taxonomy is three buckets (syntax / structural / semantic) to our two, but it rhymes with our 18/9 split — from someone else's data.
  • Which is why a whole line of work pushes semantics into the decoder. Monitor-Guided Decoding (Agrawal, Kanade, Goyal, Lahiri, Rajamani, NeurIPS 2023) uses static analysis to guide generation; SemGuard ("Real-Time Semantic Evaluator for Correcting LLM-Generated Code," ASE 2025, arXiv:2509.24507) goes further and reports that semantic errors — code that compiles but behaves wrong — are the majority of faults in generated code, and supervises them line-by-line during decoding.

Our narrow, defensible contribution — and we claim nothing larger — is this: most of that work either names the syntax/semantics gap or attacks it inside the decoder. We took a language small enough to be fully GCD-constrained — a whole application language, not a JSON schema — paired it with a full behavioral simulator outside the decoder, and then measured the exact seam: 18 structural failure modes the grammar erases, 9 semantic ones it can't touch (each shown to require a symbol table, integer math, or direction resolution no context-free grammar can express), plus behavior, which no grammar checks at all. It's a small empirical point on an open question, not a result that beats anyone. If you want to check it, the comparison write-up shows its work and the suite is one command to re-run.

Try it — and where it's actually useful

Be clear-eyed about the limits, because they're real and they're the reason the consumer-app pitch died: 64 integer registers, no floats, no strings in registers, forward-only control flow (no loops), about 256 bytes of state. That gets you counters, calculators, tic-tac-toe, simple forms, tiny dashboards. General-purpose apps are a wall, and pretending otherwise would undo the only thing this post is for.

Where a verifiable, fully-constrainable substrate might genuinely earn its keep is the opposite of general: narrow, high-stakes, naturally-small domains where "it parses" is nowhere near "it's correct" and a custom verified pipeline pays for itself — verified state machines, financial and tax rule sets, protocol / config / policy DSLs, deterministic game logic for a single engine. If that's a problem you have, we'd genuinely like to see what you do with it.

Things you can poke at today:

  • The playground — write Diell, watch the engine run it, inject faults: ionian.co/projectdiell_0.8.8/playground.
  • The whole language in one filediell.txt is the single document an AI loads to learn Diell in one shot.
  • The honest claims ledger and the GCD write-upthe Research page.
  • Run it yourself from the source: npm run v09:parity (the full byte-exact gate) and npm run channel:demo (the verify → render stages, end to end on a verified program; the generation stage is stubbed where no local model is present).

Claims ledger

Every number in this post, tagged. Owning these is the only reason it's worth reading.

| Claim | Status | The honest version | | --- | --- | --- | | "~5–8× fewer output tokens" | abandoned | A design target, never measured. The real audit refutes it. | | Flagship ≈ 593 tokens | estimated, wrong | A chars/3.75 guess that undercounted by ~52%. | | Flagship = 1,230 / 1,244 tokens | measured | Real cl100k_base / o200k_base, June 2026. | | React's tic-tac-toe is terser in source | measured (proxy) | By our source-token / chars÷3.75 proxy, not real BPE — noted in our own benchmark file before we admitted it out loud. | | Token/energy savings | abandoned | Not selling it. Frontier models write real UI code cheaply now. | | Grammar rejects the structural failure modes | measured (grammar) | The grammar structurally rejects 18/27 fixtures (incl. all 7 observed LLM modes); under GCD those become impossible to emit by construction. The 9 semantic modes pass the grammar — the simulator catches them. | | 18/27 structural vs 9/27 semantic | measured | One 27-fixture suite; representative, not exhaustive. | | GCD distorts toward valid-but-worse | cited, not ours | Park et al., NeurIPS 2024. We credit it; we didn't find it. | | The language builds real general apps | false / abandoned | 64 int registers, ~256B state. Counters to tiny dashboards, full stop. | | Useful as a verifiable substrate for narrow domains | open / unproven | A hypothesis we'd like tested, not a result. |

Built in ~48 hours with Claude Fable 5, from its second day until our access window closed. The fast part and the self-critical part were the same sprint. We think that combination — not the language — is the thing worth copying.

Project Diell is kept online and interactive on purpose. If you build something narrow and verifiable with it, or you can break the 18/9 claim, tell us: ionian.co/contact.