AI summaryⓘ
The authors studied whether asking large language models (LLMs) to reason like scientists, specifically using a prompt that mimics Popperian falsification, genuinely helps code generation. They tested this by comparing the Popperian prompt with simpler controls, including just labeled scaffolds and random-length prompts. On a smaller model, both the Popperian prompt and a simpler labeled scaffold improved code, but the Popperian procedural details did not add extra benefit. On a more advanced model, performance was near the ceiling with no clear differences. Overall, the authors conclude that the structure of the prompt scaffold matters more than the scientific reasoning content itself for improving code correctness in these cases.
large language modelsprompt engineeringPopperian falsificationcode generationscaffoldablation studyself-judgeexecution correctnessHumanEvalbest-of-eight
Abstract
Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.