What Doesn't Work (and Myths)

Why this matters

Prompt engineering accumulated a thick layer of folklore before anyone measured it carefully. Tricks spread on social media because they sounded plausible and someone posted a screenshot where they appeared to work. The problem: a single screenshot is an anecdote, not evidence. Large language models are stochastic, and on any given input a "magic phrase" can coincide with a better answer by chance. When you scale that anecdote to a production system handling thousands of requests, the magic evaporates and you're left paying for tokens and complexity that buy nothing.

The honest framing — the one Schulhoff and collaborators adopt in The Prompt Report — is empirical. A technique is only worth keeping if it beats a baseline across a real test set, measured against your own task. That discipline immediately demotes several beloved tactics from "best practice" to "unproven." This lesson covers the most important ones so you stop cargo-culting and start measuring.

Three myths the evidence doesn't support

1. Role prompting improves accuracy

"You are a world-class mathematician" or "Act as a senior security engineer" is the single most repeated prompt trick. The intuition is seductive: prime the model into an expert persona and it answers like an expert. For accuracy on objective tasks — math, classification, factual QA — the empirical prompt-engineering literature finds no statistically significant benefit. Studies that swept many personas across benchmark questions found that the best persona was essentially unpredictable and the average effect was noise. You cannot know in advance which role helps, which is the tell-tale signature of a non-effect.

Where role prompting does earn its keep is style and framing, not correctness. "Explain this to a five-year-old" genuinely changes register. "Write as a terse senior engineer reviewing a PR" genuinely changes tone and length. So keep role prompting for shaping how the model speaks; drop it as a lever for whether the answer is right.

2. Threats and bribes make the model try harder

"I'll tip you $200." "My job depends on this." "If you get this wrong, something terrible will happen." These circulated as accuracy boosters. The rigorous evidence is thin to nonexistent. Early viral claims were based on small, uncontrolled tests, and later attempts to reproduce them across models and tasks did not hold up. Schulhoff has been blunt about this in interviews: the tip/threat genre is folklore, not a finding. It costs you nothing but credibility to omit it — and it can subtly distort outputs (a model told a life depends on the answer may become evasive or over-hedged). Don't build it into a system prompt you'll maintain for a year.

3. Chain-of-thought is always worth it

"Let's think step by step" is real — it was one of the genuine breakthroughs. But "always add CoT" is the myth. The benefit is model- and task-dependent:

On reasoning-heavy tasks (multi-step math, logic), explicit CoT historically gave large gains on models that weren't trained to reason.
On modern reasoning models that already think internally, bolting on "think step by step" is redundant and can even hurt by fighting the model's native process.
On simple lookups or classification, CoT adds latency and tokens for no accuracy gain, and can introduce errors by rationalizing a wrong answer.

So CoT is a tool with a domain of validity, not a universal upgrade. Treat "should I use CoT here?" as an empirical question per model and per task, not a default.

How to separate folklore from findings

The method is the same regardless of which trick you're evaluating:

Build a small graded test set — 50 to a few hundred representative inputs with known-good answers or a clear rubric.
Establish a baseline prompt with the trick removed.
Add the trick as the only change and re-run. Change one variable at a time.
Measure the delta on your metric, and run multiple times — stochasticity means a single run can mislead. If the gain is within run-to-run noise, the trick does nothing for you.
Keep it only if it clears the bar on your task. A technique that helps GPT-class models on GSM8K may do nothing for your Haiku-powered support classifier.

A worked example

Suppose you're classifying support tickets into billing, bug, or feature-request. A teammate insists on prepending "You are a world-class support triage expert with 20 years of experience. A customer's livelihood depends on correct routing." You run both prompts over 200 labeled tickets, five times each. Baseline accuracy lands around 0.89 with a spread of ±0.02 across runs. The "expert persona + stakes" version lands at 0.88–0.90 — fully inside the noise band. Conclusion: the persona and the stakes buy nothing here. You delete 40 tokens from every request, cut cost and latency, and remove a line nobody can explain in six months. That is the payoff of measuring instead of believing.

Pitfalls

Confirmation bias on a single output. You added a trick, the next answer was good, you concluded it works. One sample of a random process proves nothing.
Generalizing across models. Findings are tied to the model family and version tested. A 2023 result on an instruct model may not transfer to a 2026 reasoning model. Re-test on the model you actually ship.
Confusing style wins with accuracy wins. Role prompting changing the tone is real and useful; don't let that fool you into thinking it fixed correctness.
Stacking unmeasured tricks. Five folklore phrases bolted together produce a brittle, bloated prompt where you can't attribute any effect — and you'll be afraid to touch it.
Treating absence of evidence as proof of harm. "No significant accuracy effect" means don't rely on it for accuracy, not "it's forbidden." Use these tactics where they legitimately help (style, format), just not where the evidence is empty.

The throughline: be empirical, be honest about what's mixed, and let your test set — not a viral screenshot — decide what stays in your prompt.

Role prompting for accuracy

Why it's better: The expert persona has no measurable effect on arithmetic accuracy — you can't predict which persona helps, the signature of a non-effect. What actually moves correctness here is requesting explicit reasoning plus a structured final answer, which is testable and reproducible.

Stakes / bribery framing

Why it's better: The threat-and-bribe language has no rigorous evidence behind it and can make the model over-hedge or refuse. The reliable gains come from a precise schema, an explicit null policy, and a parseable output format — all of which you can verify against a labeled set.

Reflexive chain-of-thought

Why it's better: For a trivial classification, forced CoT adds latency, tokens, and a chance to rationalize a wrong answer — and on a modern reasoning model it duplicates internal thinking. Reserve explicit CoT for genuinely multi-step problems where a test set shows it helps on your target model.

What Doesn't Work (and Myths)

Why this matters

Three myths the evidence doesn't support

1. Role prompting improves accuracy

2. Threats and bribes make the model try harder

3. Chain-of-thought is always worth it

How to separate folklore from findings

A worked example

Pitfalls

Role prompting for accuracy

✕ Weaker

✓ Stronger

Stakes / bribery framing

✕ Weaker

✓ Stronger

Reflexive chain-of-thought

✕ Weaker

✓ Stronger

Key takeaways

Further reading