Meditations on Vibe Coding
I recently joined an AI Agent company and it has me meditating on the way I do software engineering work.
A couple of years ago, not long after the release of GPT-4 and the rise of Cursor, I made the decision to stop writing code by hand. Occasionally, I still had to, but I went out of my way, and made it a point, to try hard to get the AI to write the code for me through my prompting no matter what. Admittedly, this once led me to accidentally ship critical production-breaking code courtesy of GPT 3.5
This was a bold move at the time; it was a big bet on the inevitability that AI models would get much better very soon and be able to handle much more.
A couple of months ago, I would only use gpt-5-high in Cursor in agent mode for all of my coding; this shifted in late august as OpenAI massively improved the Codex CLI, where gpt-5-high could perform massive complex amounts of codebase work from a single prompt in one-shot.
This shifted once more when gpt-5-codex came out. The codex model was a gpt-5 model fine-tuned for agentic software engineering work inside the codex CLI harness.
By this point these words probably sound like gibberish, so to make it simpler:
Vibe Coding is when you tell the AI to write a whole bunch of code for you autonomously. When you already know how to write software, it’s a lot easier because you can give specific direction and context to the AI and usually get what you want.
the codex CLI and cursor are both tools that act as agent harnesses, which is a fancy term for an environment for an AI model to live in and use tools in a loop to do things inside a computer.
gpt-5-high is a smart little guy. gpt-5-codex-high is a really smart little guy that loves coding inside your computer until the job is done and he gets all the little details right. his logic is better but often harder to read.
oh, and he really takes his sweet time to get things done.
That was something I really liked about the codex model and CLI. I could get real specific in the overall behavioral changes I wanted made to an app, and it would perfectly figure out all the logical changes. I had to really let it do what it wanted though. This meant, essentially, spending 3x the amount of time perfecting the prompt than I used to, and then really letting codex go and work on its own for up to half an hour. It felt, at this point, like I was a passenger jet pilot letting go of the steering wheel while the plane flew on autopilot.
Usually at this point I’d go for a walk and come back to a perfected codebase. It was wonderfully sweet, especially for specific narrow SaaS work. It was also included in the $200 a month ChatGPT Pro plan and I never hit any limits.
At the AI agent company, the expectations are for us to leverage the frontier of AI coding. I am allocated a generous budget to use Cursor, often using the very expensive (but fast) Claude models.
Where Codex was a Leonardo DaVinci that I had to be very patient with, Claude is a different beast; a wild mustang who I sometimes have an unsteady relationship with. If Codex works like a careful artisan, Claude is more like a printing press. It can comprehend and churn out enormous amounts of code very quickly, but it can confidently go down wrong paths with difficult logical traps it struggles to find its way out of. Sometimes it even needs Codex’s handiwork to give hints on finding its way to a solution after it digs itself into a hole.
The dichotomy of it all is fascinating. When you direct Claude and get very specific, it can perform incredibly well (and is often a better frontend designer). But when you want to operate at a higher abstract level conceptually on making deep logical changes, Codex has been the operator I am most fond of.
It also gets tricky to cut through the noise and figure out what everyone else is talking about in the AI coding space, as everyone has completely different coding preferences, AI names overlap, and agent tooling varies. You can hear 10 people say “Codex is great!” followed by 10 people who say “Codex sucks! Claude is great!”.
When someone says “Claude”, you never know if they mean Claude in Claude Code, Claude in Cursor, Claude in Github Copilot, or Claude in some completely different agent CLI like Amp, OpenCode, or FactoryAI.
It’s the same when someone says “Codex”: Do they mean just the codex CLI, where one can use the gpt-5 series, OR the gpt-5-codex models; or do they mean gpt-5-codex in cursor? And do they mean gpt-5-codex-low or gpt-5-codex-medium or gpt-5-codex-high (all different reasoning effort levels).
And OpenAI used to have 4 separate products named codex, and you never could figure out which one people were talking about when they said codex was good or bad, or when they even tried it.
All of this can make one’s head spin. But this head-spinning itself has drawn my focus. There’s only so much humans can do. Our capacity to review has become the bottleneck. The dichotomy of the behaviors between models and agent coding styles leads me to believe we are at a fork in the road for how humans work with coding agents.
Either we will:
A) Get better at simultaneously running multiple agents in parallel to work on separate tasks where they don’t step on each other’s toes using better git tooling
or
B) Receive much smarter high-level orchestrator agents that are capable of managing and running agents in parallel to work on big tasks in greater depth but more quickly
The mind of the human engineer is one of the most important aspects of this dilemma; it will be the guiding principle for figuring out where things go; the future can only evolve around the cognitive architecture of the human.
As agents get evermore intelligent, the complexity and volume of work they can produce will grow significantly. In fact, it already feels very much at a point where it is difficult for most engineers to keep up with and comprehend every single change their agents make.
Running multiple large unrelated tasks in parallel means context-switching between tasks. As the depth and necessary complexity grows for each task, giving full attention to every detail while testing and switching between each agent’s results will get even more progressively harder for humans to handle.
However, when humans stay fixated on a single task, even if it’s enormously large, complicated, and deep, they are able to get much more done the longer they work on it. Sometimes an engineer’s best work is done several hours into a day, after stuffing all of the context and understanding about the problem into their heads to the point they can account for every detail that matters.
So when I think of the future of work, I think of what will scale for the human worker; the individual who ultimately bears accountability for the outputs of the AI. AKA, the worker who actually faces ultimate personal responsibility if a system fails.
For the individual human, it seems we very much scale vertically. We can go supremely deep on a complicated task, feature, bug. We suck at multitasking. We always have, and we probably always will. Those who believe they are incredible multitaskers tend to only be able to operate on more shallow tasks and often aren’t as productive as they think.
Therefore, it seems likely that the future of coding, and human-AI-agentic work in general, will be on greater depth of problems.
Building a new feature? Forget what the MVP is; build what would typically be considered a fourth or fifth iteration.
Fixing a bug? Add guardrails for every possibility imaginable.
Adding tests? Add full-fledged end to end tests for every happy-path user story related to what you’re working on.
But go outside your depth on a completely unrelated task, and suddenly you’re having to juggle and switch around completely different sets of context. This is much much harder for the human mind to handle.
Frankly, this wouldn’t matter much if the human taste element didn’t matter. But it does still matter, because much of what is still being built is around the human-interaction element of any system, even when that system is both built by AI, and for controlling AI itself.
And so diving deep on how you as a human would prefer to use the system, and drilling away at making it perfect with as much depth and context as possible, becomes the job.
However, if we can somehow perfectly automate taste and get AI itself to intuitively figure out all the necessary implementation of any business logic, and how the interface logic should look moving forward for any given application, things will change.
At the point that AI can recursively dive deep and manage all context on its own as it fleshes out the details of any given problem, the human can be free of deeply managing the context of its AI subordinate, and then begin quickly moving horizontally on different tasks for wider ranges of problems at once.
That said, I think we are still far from that level of intelligence being widely available for many software developers.
Additionally, we should consider how AI agents are currently used by developers within the labs themselves. Nearly everyone inside OpenAI is using Codex for all of their code; nearly 100% of pull requests within OpenAI are now also reviewed by Codex. I should add that the codex review tool is very good; it focuses very tightly on critical logical impacts of a change, rather than code style nitpicks.
Similarly at Anthropic, they are heavily dogfooding Claude Code. An interesting capability of Claude Code is that it has a subagent feature, enabling Claude to act as a top-level commander to spin off other agents to work on isolated parts of a task. I believe this approach has the highest amount of promise moving forward, but of course the models themselves have to be tuned to leverage it appropriately.
Anyway, I think some interesting metrics for any knowledge worker to track in their career right now is:
How many agent threads (including subagents) are you spinning up and managing on a daily basis?
How many high-level tasks are you completing each week?
The other metrics; tokens generated, amount spent, lines of code etc. may not matter as much. After all, if a super-genius AI can solve all your problems by generating the perfect set of just a few thousand tokens, instead of a dumber AI generating hundreds of thousands of useful but-not-quite-what’s-needed tokens, the super-genius tokens are usually preferable.
Therefore, at a higher level these metrics track:
Worker impact on business
The amount of mental context the developer is managing to achieve that impact.
I think these will be useful to track moving forward, because it gives the individual the ability to see how much more capacity they can take on (ie, are agents improving at spinning up and managing subagents autonomously? if so, is the amount of context the user needing to manage in a top-level discussion decreasing?). It gives a direct insight on how much leverage they are acquiring, if the power of their toolset is growing, and also highlights genuine limits on how much further they can go.
For instance, let’s say AI improvement stops here. Autonomous subagent handling never improves; Context lengths that the models themselves can understand stop getting much better. There remains a genuine limit to what the human can manage in producing, using, and validating AI outputs. Finding this human limit to perform this management will be a process of learning and discovery for each person.