Prompt Engineering is Not a Real Job (But You Still Need to Learn It)
The 'prompt whisperer' industry is mostly a grift, but there are three techniques that genuinely help when working with language models.

Last October I spent four hours trying to get a language model to extract structured data from scanned restaurant menus. The task was simple enough: take a menu image that had been OCR'd into messy text, pull out dish names, prices, and dietary tags, and return JSON. I had the OCR pipeline working. The JSON schema was defined. The model kept returning garbage.
Not random garbage. Plausible garbage. It would invent dishes that weren't on the menu. It would hallucinate prices that looked reasonable but were wrong. It would mark things as vegan that clearly contained shrimp. And the maddening part was that sometimes, maybe one in five tries, the output was perfect. Exactly right. So I knew the model could do it. It just... mostly didn't.
I tried rewriting the prompt a dozen times. I tried being polite. I tried being aggressive. I tried wrapping the whole thing in XML tags because someone on Twitter said that helped. I tried telling it to "take a deep breath." I tried offering it a hypothetical tip. None of this made a consistent difference, and if you've spent real time working with these models you already know why: most of the prompt engineering advice floating around is cargo cult nonsense dressed up in blog posts and $200 courses.
But here's the thing. Eventually I did fix it. And looking back, the fix wasn't some magic incantation. It was boring. It was just... communicating more carefully. Which is, I think, the only honest thing you can say about this whole field.
The part where most advice falls apart
There's a strange disconnect between how prompt engineering is discussed online and how it works in practice. The online version is full of "secret techniques" and numbered lists and confident claims that adding the phrase "you are an expert" to the beginning of a prompt will improve output quality by 40%. People screenshot their ChatGPT conversations where they got a nice result and present it as proof that their method works. Nobody talks about the four attempts before that one where the method produced nothing useful.
I've tested a lot of these so-called techniques over the past year or so. Persona prompting, where you tell the model to act as a specific role. Temperature tweaking. The "tree of thought" approach. Prompt chaining across multiple calls. Some of them help sometimes. Most of them don't help reliably enough to build anything real on top of.
The persona trick is a good example of something that sort of works but not for the reasons people think. When you write "You are a senior database administrator with 20 years of experience" at the top of your prompt, you're not giving the model expertise. The model doesn't have experience. What you're doing, as far as I can tell, is nudging the token distribution toward patterns associated with technical database content rather than, say, casual forum posts. It's a filter, not a teacher. And sometimes that filter helps. When you're asking about PostgreSQL indexing strategies, framing the response space around database-expert-sounding text probably does bias the output toward more accurate technical content. But when you're asking it to analyze a specific schema you've pasted in, whether the model is "pretending" to be a DBA or not makes almost no measurable difference to correctness.
The whole "emotional manipulation" thing is even weaker. "This is very important to my career." "My boss will fire me if this is wrong." People report that these phrases improve output quality. Maybe they do, maybe they trigger some statistical association with high-stakes human writing that happened to be more careful. But in my testing across hundreds of calls through an API, appending emotional stakes to a prompt had no statistically significant effect on the accuracy of structured extraction tasks. Zero. It might matter for creative writing where "quality" is subjective. For anything you can actually measure? I haven't seen it.
What actually moved the needle
So back to the menu extraction problem. What fixed it wasn't cleverness. It was three things, and they're all sort of the same thing wearing different hats.
The first was being obnoxiously specific about the output format. Not "return JSON" but spelling out the exact schema, field by field, with types and constraints. I went from asking for structured data to providing something like "Return a JSON array where each element has the fields name (string, exactly as written on the menu, do not paraphrase), price (number, in dollars, null if not listed), and dietary_tags (array of strings, only from the set: vegan, vegetarian, gluten-free, contains-nuts, contains-shellfish, or empty array if none indicated)." The prompt went from two sentences to a full paragraph of specification. It felt like overkill. It worked.
And I think this is the thing that actually matters about system prompts and instruction design. People talk about system prompts as if the key insight is that you can set a "role" for the model. That's the least interesting part. The interesting part is that you can constrain the output space before the model starts generating. Every constraint you add โ this field must be from this set, this value must be a number, don't include conversational filler โ narrows what the model considers as valid next tokens. You're not making it smarter. You're making it harder for it to go wrong.
The second thing was examples. I hate saying "few-shot prompting" because it sounds like jargon meant to make a simple idea seem more impressive than it is. But showing the model two or three completed input-output pairs before asking it to handle a new input made an enormous difference. Bigger than any other single change. For the menu task, I took three real menus I had already manually extracted, pasted the OCR text and the correct JSON output for each, and then pasted the new menu I wanted processed. Accuracy went from maybe 60% to something like 90%.
Why does this work so well? My best guess โ and it's a guess, because nobody really knows how these things work inside โ is that examples do something that instructions can't. Instructions tell the model what you want in natural language, but natural language is ambiguous. When I write "extract dish names exactly as written," what does "exactly" mean? Does it include the description after the dish name? Does it include the price that's on the same line? The model has to interpret my instructions and that interpretation varies run to run. But when I show it three examples of what "exactly as written" looks like for me, for this task, with this kind of data โ it pattern-matches from the examples instead of interpreting my words. The examples are the spec. The instructions are just commentary.
The third thing was breaking the task apart. Not in the "chain of thought" sense of asking the model to show its reasoning, though that's related. I mean literally splitting the prompt into stages. First call: "Here is OCR text from a restaurant menu. List every distinct dish mentioned, one per line, with nothing else." Second call: "Here are dish names from a menu and the original OCR text. For each dish, find the price. If no price is listed, write NULL." Third call: assembly into the final schema with dietary tags. Each call was simple enough that the model almost couldn't get it wrong. The final output was better than any single-call approach I tried, even with perfect instructions and examples.
This is the chain-of-thought idea taken to its logical conclusion, I think. People talk about chain of thought as asking the model to "think step by step" inside a single prompt. And that does help for reasoning tasks โ math problems, logic puzzles, anything where the answer depends on intermediate calculations. When the model writes out "First, I need to find X, which means calculating Y" before jumping to the answer, those intermediate tokens become part of the context and give the prediction engine more signal for the final answer. It's not thinking. It's context enrichment. But for complex structured extraction, I found it even more effective to actually separate the steps into distinct API calls. Each call has a narrow, well-defined job. The model can't get confused about what it's supposed to be doing because it's only doing one thing.
The stuff I'm less sure about
There's a whole category of prompt engineering advice that I can't confidently say is good or bad. It just... depends.
Formatting your prompt in XML-like tags, for instance. Some people swear by wrapping sections in <context>, <instructions>, <examples> blocks. I've done this and it seems to help with longer prompts where there's a lot of different content mashed together. But does it help because the model was trained on XML and recognizes the structure? Or would markdown headers work just as well? I've tried both and honestly can't tell the difference consistently. My instinct says the tags help more when the prompt is over a thousand tokens or so, where the model might lose track of which part is instructions and which part is data. For shorter prompts it's probably noise.
Temperature is another one. The standard advice is to use low temperature for factual tasks and higher temperature for creative ones. Sure, fine, that makes sense directionally. But the specific value? I've seen people argue with great conviction that 0.3 is the sweet spot for code generation while 0.7 is right for marketing copy. I don't buy it. In my experience the difference between 0.2 and 0.5 on a factual task is negligible if the prompt is well-constructed. Temperature matters most when the prompt is ambiguous โ it amplifies the variance that's already there. Fix the prompt and temperature becomes less important. I usually just leave it at whatever the default is unless I have a specific reason to change it.
And then there's the meta-prompting stuff. Asking the model to critique its own output, or having one call generate a response and a second call evaluate it. This is interesting and I've seen it work for catching certain kinds of errors. But it also doubles your API costs and latency, and the evaluator model has the same blind spots as the generator model. A model that confidently hallucinates a fake dish name will often confidently approve that same hallucinated dish name when asked to review the output. You get diminishing returns fast. For anything that really matters, a human still needs to review the output. That hasn't changed.
The uncomfortable truth about all of this
Here's what I keep coming back to. The stuff that works in prompt engineering is, at its core, just clear communication. Be specific about what you want. Show examples of what correct output looks like. Break complex tasks into simple steps. These aren't techniques. They're what a competent manager does when delegating work to a new employee. The entire field of prompt engineering, stripped of its mystique, is project management for a very fast, very unreliable, very well-read intern.
Which raises the question of whether prompt engineering is going to remain a thing. The models keep getting better. Tasks that required elaborate prompt acrobatics a year ago now work with a simple, direct question. The menu extraction problem I described? I recently tried it again with a newer model and a bare-bones prompt. It worked almost as well as my carefully engineered multi-stage pipeline. Not quite as well, but close enough that I'm not sure the engineering effort was worth it anymore.
I've noticed this pattern repeatedly. You spend weeks crafting the perfect prompt pipeline for some task, and then six months later a new model comes out and a naive prompt gets 95% of the way there. The shelf life of prompt engineering work is short. It's like optimizing assembly code in the 1990s โ valuable at the time, but the compiler was about to get a lot better.
So should you learn prompt engineering? I think... sort of? The underlying skill โ being precise about requirements, providing good examples, decomposing problems โ that's timeless. Those are just engineering skills applied to a new interface. The specific tricks and templates and frameworks? Those are ephemeral. They're tuned to the quirks of current models, and current models are a snapshot of a thing that's changing fast.
I still catch myself spending too long tweaking a prompt when I should be questioning whether the model is the right tool for the task at all. There's a bias that sets in once you've invested time in prompt engineering: every problem starts looking like a prompting problem. Sometimes you just need a regex. Sometimes you need a database query. Sometimes you need a human. The best prompt engineers I know are the ones who are quickest to say "this isn't a language model problem" and reach for something else.
Where is this all headed? I genuinely don't know. The optimistic view is that prompting becomes trivially easy as models get better at understanding intent, and the whole discipline dissolves into regular software engineering. The pessimistic view is that models plateau and prompt engineering becomes an increasingly arcane specialization, like SEO but for AI โ a field of diminishing returns and escalating complexity that exists only because the underlying system is imperfect. I'd bet on something in between, but honestly, I'd bet with low confidence. The pace of change makes prediction feel pointless.
For now, what I can say is this: if you're working with language models and getting inconsistent results, the answer is probably not a cleverer prompt template you found on Reddit. It's probably that you need to be more specific about what you want, show the model what good output looks like, and break the task into smaller pieces. That's unglamorous. It doesn't make for a good Twitter thread. But after a year of doing this daily, in production, with actual users depending on the output โ it's what I've got.
Maybe that'll be different next year. Probably will be, actually. I keep writing things down about how these models work and then having to cross them out six months later. So take all of this with appropriate uncertainty. Including the uncertainty itself, which is maybe the most honest thing I can offer on the topic at this point.
Written by
Anurag Sinha
Developer who writes about the stuff I actually use day-to-day. If I got something wrong, let me know.
Found this useful?
Share it with someone who might find it helpful too.
Comments
Loading comments...
Related Articles
How LLMs Actually Got Here
A walkthrough of how we went from recurrent neural networks to ChatGPT, and where things stand now.
Neural Networks Without the Hype
What neural networks actually do under the hood. Code included. No analogies about brains.
My Terminal Setup, Explained
The Zsh, tmux, and Git configs I use daily. Annotated so you can understand what each block does and take what's useful.