Against Vibes: When is a Generative Model Useful
Let’s suppose I wanted to answer a question: is the tool X useful for the task Y. If I were scientific about this, I would analyze the properties of tool X and develop a model, and the task Y and the requirements for it and develop a model, and I would use my models to predict the behaviour of tool X in the context of task Y. “Can I use timber instead of stainless steel as a support beam for this structure?” “Will this acid be an appropriate solvent for this reaction?” “Will this programming language provide these real-time guarantees?”
The discourse on generative models is not like this. Instead, you get claims like “software engineering is dead”, and attempts to shove generative models into literally everything without a thought. Search? Generative models. Code completion? Generative models. Summarization? Generative models. Voice to text? Generative models. Stock images? Generative models.
Any attempt to criticize this tends to go in circles and/or have people arguing past each other. Is a generative model useful for internet search? Well, look, it produces text that is plausibly related to the input prompt, so. So… what? That doesn’t answer the question. But the newest models are so much better! Better at.. what?
I was upset about this when it was being called “prompt engineering”, and found no sign of engineering, but instead a series of vibes about how to phrase a prompt in a particular version of a particular model, which sometimes produced output that is plausibly related to the input prompt and therefore plausibly close to what you might have intended. I’m upset now when people are making claims that agents are so useful, but can’t tell me when or why or how they’re useful beyond vibes about feeling more productive (vibes that have been refuted by real science contrasting objective measure of productivity vs. subjective reports), or examples of having produced a lot of plausible output.
(Okay, there are some researchers doing actual science and writing papers; I’m talking about the arguments being made as this stuff is integrating into schools, workplaces, etc.)
I want to know when generative models are useful. I don’t want to feel like they’re useful, that’s just a vibe. I’ve been a generative model skeptic basically from the beginning. I could not convince myself that generative models were useful. But I was also skeptical of my own subjective experience. I could imagine that a model capable of produce code from natural language would be useful, in some use cases that I had not found. I imagine there must be a model of when a generative model X is useful for task Y.
In this post, I’m not addressing ethical, political, or social questions. Those questions are important, and I want to address them separately from what the technology is capable of. Just for context: I think the widespread deployment of this technology is deeply problematic and irresponsible. I think further investment in it at its current scale is an almost criminal level of fiduciary negligence and will cause economic harm. I think the ethics of all of this are deeply troubling.
But for now, I just want to know what they’re technically capable of.
A model of generative model utility
I think the usefulness of a generative model is a function of three things:
- What is the cost of encoding a generative task in a prompt vs. directly producing the artifact? This is a function of the task, the model, and the user.
- What is the cost of verifying the generated artifact meets requirements vs. a directly produced artifact? This is mostly a function of the task and the user, but also the generative model.
- How much is the task dependent on the artifact vs. the process? This is a function of the task.
Each of these touch on things many others have said, but I think all three considered simultaneously are important. They make it possible to be scientific in an argument about the use of generative models.
If you want to claim a new model is “more useful”, you must specify all of these variables. You must specify a class of tasks, and demonstrate the cost of encoding for some set of users is lower than directly producing the artifacts of tasks; or that perhaps encoding is higher, but that verifying design requirements is lower.
More importantly, if I want to predict whether a generative model will be useful, I have a model to work with.
My model predicts that usefulness of a generative model may decrease as task complexity increases. Generative models are probabilistic: the output will be less likely to satisfy complex requirements, particularly if those requirements differ from common patterns in the training data, or worse, subtly different from common patterns. Verifying complex requirements is also hard, and harder than having a human following good engineering processes that lead to more easily verified outputs.
On the other hand, generative models should be useful when directly creating the artifact is hard for the user, but verifying the artifact is trivial. This could be the case for artifacts that require cross-referencing extremely specific information that is time consuming for a user to do, but once done, is trivial to check. It could also be the case for generative models integrated into formal verification systems with extremely reliable and highly automated verification, where no knowledge of the artifact being generated is necessary. But in general, it is unlikely to be the case for a novice in some domain trying to generate a complex artifact, since the user will not have the expertise to ensure the output meets requirements. This predicts there will still be a need for users of generative models to have domain expertise.
The model also predict that generative models are essentially useless for tasks that are highly process dependent, since all a generative model can do is produce an artifact by a blackbox process.
1. Relative Encoding Cost
A lot of arguments in favour of the usefulness of generative models make arguments about, in essence, the relative encoding cost. For a generative model to be useful, the total encoding cost must be lower than the total cost of directly producing the artifact.
The total encoding cost includes all the work that goes in to writing a prompt, and all of the compute required to run the prompt. If the task is simple to express in a prompt, the total encoding cost is low. If the task is both simple to express in a prompt, and tedious or difficult to produce directly, the relative encoding cost is low. As models get more capable, more complex prompts can be easily expressed: more semantically dense prompts can be used, referencing more information from the training data. An agent capable of refining or retrying a task after an initial prompt might succeed at a complex task after a single simple prompt. However, both of these also increase the compute cost of the prompt, sometimes substantially, driving up the total encoding cost. More “capable” models may have a higher probability of producing correct output, reducing costs reprompting with more information (“prompt engineering”), and possibly reducing verification costs.
Moreover, as a user gets more capable, they may be able to complete the tasks directly much faster than they can prompt a model to do it, driving up the relative encoding cost.
One may argue that the newest models are “more powerful” or “actually intelligent” or whatever. But these are unscientific claims.
The scientific version of these claims is “the total encoding cost (for some class of tasks) is lower than previous models”. Phrased this way, it’s clear this still doesn’t mean the new models are useful.
For most of my tasks, I think the relative encoding cost has been high. Many of my software engineering tasks are constructing small, semantically dense programs, with very specific design requirements, in a language much more concise than English that I can write fluently. I can systematically design and implement such software faster than I can encode the specification into a prompt for a generative model.
As one example, I tried using Claude Opus 4.6 to generate a program that would interpret a custom DSL I use for typesetting grammars, and generate Haskell type definitions. After 8 hours of prompting, several million tokens, the code it generated was still absolutely useless. It passed the tests I had prompted it on, but just looking at the code, one could easily identify type errors and logic that tried to special case specific identifiers from the tests. The logic for sanitizing identifiers was a mess, and would occasionally generate empty strings. A correct implementation would take me 300—400 line of code to write, which I can certainly write in less than 8 hours.
However, not true of all tasks are writing semantically dense code with very tight design requirements. For example, I was recently trying to install a package whose name I forgot. I prompted the model to “install that x11 fake gui thing”, a trivial prompt. Actually completing the task myself would have required a lot of tedious work, with lots of accidental complexity. I would have needed to search the internet to identify the name of this software, cross-reference that with the distribution of the operating system I was running and the name used by its package manager, possibly cross-reference the installation command for this particular package manager, and then write and execute a shell script to perform the install. I was able to use the agent to do all of this with an extremely easy to write prompt. This task had a very low relative encoding cost.
2. Relative Verification Cost
Some arguments about generative models focus on verification: “formal verification will become more important as more code is generated”.
I think these arguments are also unscientific. Verification is also not magic. Software designed in one way may be easier to verify than in another way. A user who carefully designed and implemented the software may be able to more easily verify it than when dropped into a fully generated code base.
All of this depends on the task, the user, and what the model is capable of generating. For my prompt to install a package, verification is trivial. I will recognize the right command when I see it, and it’s one line long.
Relative verification cost also goes up with the size of the task. If you’re generating a 1 line script, no problem. If you’re trying to generate a very large artifact, you’re going to get bored validating every command, every edit. You’re not going to be able to check every line of code that was generated. You’re going to need some other approach to verifying the output, increasing cost.
Relative verification cost also depends on the user. If I’m prompting a model to produce Racket, a language I am very fluent in, I can quickly evaluate the design and implementation of the generated code. If I tried to prompt a model to produce C, I’d be far better off just writing the C myself, following a systematic approach that would result in safe C. And then running it in a sandbox. After running some sanitizers on it.
Relative verification cost somewhat depends on the capabilities of the model, too. Some of the early models I experimented with produced trash code. Not merely bad code with bad design, but errors so basic I wouldn’t think to look for them: it would produce Racket with mismatched parenthesis, references to functions that didn’t exist, etc. Those are easy enough to detect by running the compiler, but what about the ones that aren’t so easy to detect?
One key part of this relative verification cost is that generative models produce plausible output. It’s not accurate to say a model produces “correct” or “incorrect” output, or “makes mistakes”. It does exactly what it’s designed to do: produce output that is statistically related to the input prompt, in some way. That doesn’t mean “statistically correct”, just “statistically related”. All output is correct, in the sense that all it’s suppose to be is a point in the distribution of things related to the prompt. Maybe you produce C code with memory errors most of the time, but most C code has memory errors. Maybe you mostly produce correct bash scripts for installing packages, because most bash scripts for installing packages on the internet are correct.
Plausibility of generative models greatly increases the relative verification cost, since the output is essentially optimized to be close to correct. I’d predict that relative verification cost could go up as the models get more complex. The class of errors we’re likely to find in generated code will be very different than the class of errors we’re used to looking for in human generated code: generated code will have subtle errors. As the models get more capable, you might be more likely to trust the output, and less likely to spot these subtle errors. This cost can be reduced by formal methods, but formal methods aren’t necessarily cheap. You might be better off with an engineer following a design process.
For some tasks, verifying the output may be impossible, or at least impossible without redoing exactly the work you were trying to use a generative model for. I think internet search is a good example of this. A generative model generated response to an internet search query that you did not know the answer too is essentially unverifable, unless you go and search for credible sources to verify the summary. At which point, the generative model did entirely wasted work.
3. Artifact vs. Process
Some tasks aren’t about the output. Or maybe they aren’t just about the output, and require the output be created following a specific process.
Easy examples of this are in education. I don’t need students to implement factorial for the one billionth time because I need an implementation of factorial. They implement factorial because going through the process creates knowledge in their head. Writing code is a fundamentally different process than reading code, in the same way that writing this blog post is a fundamentally different process than reading it.
This blog post is an example of a process-driven task. I’m writing this post. My hands are typing the words that appear in this post. They are not merely typing prompts that cause a generative model to generate plausibly-related words. That’s because I’m not trying to create a blog post. I’m trying to create knowledge, within myself and then within others. Writing this post is me thinking through all the details.
Process-driven tasks also come up in engineering. Some engineering requires specific processes are followed, because if the processes are followed, the end result will satisfy certain properties that may be difficult or impossible to verify just looking at the artifact. A large fruit company, for example, might forbid engineers from contributing to open source projects as part of the process by which they engineer software in order to mitigate intellectual property risks. There is no way, just looking at the software the engineer writes, to guarantee freedom from those risks.
The process argument against generative models comes up a lot. The argument goes something like “X is about human communication, or creativity, so generative models cannot be used to create Y”. And I really sympathize with this argument, because I think far too much is produced by ignoring the process.
But there are certainly tasks for which I really only* care about the output. My shell script example is one: I don’t care how the package gets installed; I care that it’s installed. (* Well, assuming the output wasn’t produced through some truly problematic process, which, well… but that’s a future post.)
The same can be true of writing. I am writing this post manually, because the process matters, but some writing is functional. For example, I used a generative model to draft a policy document. Policy documents don’t have much creative structure to them; they express a set of rules. I’m still reading and redrafting the document, since I need the particulars to better suit me, but it was useful to start from a generated draft, in the same way I might start from a template.
But even when the output is boring and easily verified, the process may be important. Junior engineers might write lots of boring, easily verified code. It might be extremely cheap the replace them with agents. But junior engineers writing that code are going through a process by which they gain experience, knowledge, and skills. Generative models can replace their output, but nothing can replace that process.
For almost all software I write, I do care about the process. I’m typically designing software as part of research, and me doing the design and implementation work creates knowledge that I will then share. The software isn’t the important output, or not the only important output. I think this is another big reason I haven’t found these things useful, and why it’s been such a struggle to figure out how they could possibly be useful.
We just need people who can make a computer produce useful work
So when is a generative model useful? Just when the (1) relative cost of encoding the work in a prompt is low (compared to doing the work some other way); (2) and/or relative cost of verifying the output satisfies requirements is low; (3) and the process used to complete the work doesn’t matter. To judge all of this accurately, the user of the model needs to know quite a lot about the work being done, about verifying design requirements in the domain, and about working with generative models and/or the model in question.
Navigating these trade-offs is engineering. If you’re navigating those trade-offs to produce software, you’re doing software engineering. If you’re not considering these trade-offs, you’re just going on vibes and what you produce will be something between accidentally useful and extremely harmful.
These trade-offs aren’t unique to generative models, but one thing is: they’ve made it incredibly cheap to produce an immense amount of output that is plausibly described by a natural language description. But plausible doesn’t mean useful, and there’s nothing in generative models that could ever guarantee useful output. As the models get more sophisticated, the complexity of the output and the prompts are getting more sophisticated. That’s not necessarily more useful. As that complexity goes up, so do the costs: of compute, of verification, and of relying on output over process.
I understand the temptation of these tools. Sometimes useful work is incredibly complex and frustrating to do. Writing software, running scripts, and organizing all my notes can be very tedious. Sometimes that is accidental complexity, but much of the time it is essential. It is very easy to use a generative model produce output. I don’t think it’s very easy to use them produce useful output.