Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

2026-04-08 • Software Engineering

Software EngineeringArtificial Intelligence

AI summaryⓘ

The authors explain that big language models let students write working code without really understanding it, which makes grading tricky. They review three main types of systems that use conversation to help assess programming skills: simple rule-based ones, ones using these big models, and hybrids combining both. They propose a new Hybrid Socratic Framework that uses careful code checks plus a two-part conversation system to ask guiding questions and verify understanding. They also suggest safety measures to avoid wrong or copied answers and to protect privacy. This framework is meant to support, not replace, traditional testing by confirming students truly grasp their code.

Large Language ModelsAutomated Programming AssessmentConversational AgentsRule-based SystemsHybrid SystemsCode UnderstandingSocratic MethodScaffolded QuestioningRuntime VerificationPrivacy Safeguards

Authors

Eduard Frankford, Erik Cikalleshi, Ruth Breu

Abstract

Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

View PDFOpen arXiv