Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
2026-04-09 • Software Engineering
Software EngineeringCryptography and Security
AI summaryⓘ
The authors studied how well four advanced language models can find security weaknesses in code, especially those involving multiple connected functions. They tested these models on different amounts of context—from just one function to including its callers and callees—using real vulnerabilities in C, C++, and Python. Their results show that giving the models more context helps detect tricky vulnerabilities, and some models like Gemini 3 Flash and Claude Haiku 4.5 perform particularly well. This work helps improve AI tools that check code security across different programming languages.
Large Language ModelsVulnerability DetectionInterprocedural DependenciesCode AnalysisF1 ScoreCallers and CalleesReposVul DatasetC ProgrammingPythonAutomated Security Tools
Authors
Kevin Lira, Baldoino Fonseca, Davy Baía, Márcio Ribeiro, Wesley K. G. Assunção
Abstract
Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.