AI Coding Milestones I'm Looking Forward To
I’m trying to imagine future “Opus 4.5 moments” that will make my head explode. Here I’ll be keeping a running tally of what can be few-shotted by present day models.
Spreadsheet -> App
Most business processes are encoded in spreadsheets and much of the software work in the last 3 decades was about “scaling” beyond them. Since build software was expensive, we got band-aids like ODBC connectors and Airtable, but none have truly replaced the inimitable spreadsheet. I expect this calculus to change significantly as frontier AI models improve.
One-shotting a spreadsheet into an app is far more complicated than analyzing Excel formulas. The hard part is reconstructing the business process hiding behind the workbook: who enters data, who reviews it, which cells encode policy, which tabs are dashboards, and what workflow the spreadsheet is secretly coordinating.
Generally, getting the “context” correct requires going beyond the spreadsheet itself and into things like emails, chat, access logs, etc.
Many companies are run on spreadsheets. In a way, this assesses a model’s ability to understand a business, holistically, and relate that to each cell, each row, each sheet.
Cross Platform, Open Source Email Client
It’s 2026 and we still don’t have an open source email client that’s better than free web apps like Gmail. People have tried, but they often blame the complexity hell of email standards + provider-specific features in Gmail/Outlook that consumers have grown accustomed to. The final boss is then dealing with every platform change (macOS, Windows, iOS, Android, Linux) and you quickly see that this is/was untenable outside of for-profit companies.
The test for AI agents is not just understanding each operating system, each email provider, and email standards, but also autonomously keeping up with changes, such as:
- After a new macOS version is released and docs are available, a PR is opened, CI is mostly passing, and ready for review.
- Google now has a Discovery Service, so an agent can regularly poll this for any changes to Gmail.
- Support for up-and-coming email providers is trivially added.
My hope is that email clients are treated with the same quality as browser clients. One can dream…
AI Jepsen Tests & Analysis
I am positive at least one person has already tried “Kyle Kingsbury as an AI”. The challenge is understanding the intricacies of distributed systems and how they can fail, and then provably showing it through a reproducible test & analysis.
(of course, then an AI agent can be spun up to make the test pass…)
This would be a fantastic display of AI intelligence, since it requires high-level distributed systems thinking paired with coding capabilities to improve existing systems. My hunch is this is doable for a large portion of distributed systems, as they’ve really converged over time.
Rewrite complex_app in a different language
The goal isn’t to rewrite everything in Rust, but the ability to do so would be an authoritative display of AI capabilities. Even though LLMs are getting very good at coding and understanding language syntax and features, consistently rewriting a program of high complexity is much harder.
By “high complexity” I’m referring to things like FFmpeg, Linux, Postgres, Chromium, AOSP, PyTorch, etc.
Successfully rewriting a program from one language to another requires at least the following capabilities:
- Understanding the tradeoffs between source and target programming languages.
- Knowing when a translation will result in feature regression vs. increasing features. For example, moving Rails from Ruby —> Elixir could result in increased reliability.
- Accurately evaluating the output in a way that’s legible to humans
These would be incredibly difficult for one person to do, let alone a team or org of people focused on migrating code.
I like this because, in some cases, verification is very doable, like when Ladybird ported LibJS from C++ to Rust using Claude Code and Codex. It was really clever to ensure identical bytecode between the C++ and Rust compilers.
If a project has thorough tests, perhaps that’s a good first step to compare before & after for a benchmark.
Port feature from one language to another
Sometimes I dream about questions:
- What if Erlang-style “supervisors” were a core feature in every programming language?
- Could we get type systems as expressive as OCaml or Haskell in any statically typed language? What about optional static typing like TypeScript, but everywhere, and just as good?
- Why doesn’t every language have immutable data structures, optional chaining, a decent REPL, good error messages, usable concurrency, etc.?
In general, programming languages have been converging over time, but I want it to happen much faster. I’d love to see a feature in one programming language, and then one-shot it into the language I’m using today.
This milestone is similar to the last one in spirit, but in practice will look very different. Evaluating this will be much more subjective given how different programming languages can be.
We’re starting to see language designers use AI much more, like Matz prototyping an AOT compiler for Ruby with Claude. I want to see more!!