A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings
2026-02-17 • Software Engineering
Software Engineering
AI summaryⓘ
The authors studied how well large language models (LLMs) keep code working the same when they rewrite it, which is called refactoring. Instead of using regular test cases, they used a method called differential fuzzing that automatically generates many random inputs to better check if the new code works like the original. They tested six different LLMs and found that 19-35% of the time, the rewritten code did not behave the same. They also discovered that about 21% of these problems were missed by the usual test cases, meaning current tests might not catch all errors in LLM-refactored code.
large language modelscode refactoringfunctional equivalencedifferential fuzzingautomated testingtest casessoftware correctnesssemantic divergencecode evaluationprogram semantics
Authors
Simantika Bhattacharjee Dristi, Matthew B. Dwyer
Abstract
With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work typically relies on predefined test cases to evaluate correctness, in this work, we leverage differential fuzzing to check functional equivalence in LLM-generated code refactorings. Unlike test-based evaluation, a differential fuzzing-based equivalence checker needs no predefined test cases and can explore a much larger input space by executing and comparing thousands of automatically generated test inputs. In a large-scale evaluation of six LLMs (CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, and GPT-4o) across three datasets and two refactoring types, we find that LLMs show a non-trivial tendency to alter program semantics, producing 19-35% functionally non-equivalent refactorings. Our experiments further demonstrate that about 21% of these non-equivalent refactorings remain undetected by the existing test suites of the three evaluated datasets. Collectively, the findings of this study imply that reliance on existing tests might overestimate functional equivalence in LLM-generated code refactorings, which remain prone to semantic divergence.