From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
2026-04-17 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors studied how well four advanced language models simplify difficult Vietnamese legal texts to make them easier to understand. They evaluated the models on accuracy, readability, and consistency, then did a detailed error analysis to see why they performed that way. They found a trade-off where some models write clearly but make more legal mistakes, while others are accurate but still have hidden reasoning errors. The main problems were giving wrong examples and misinterpreting legal rules, showing that careful legal reasoning is the hardest part for these models. Their combined method gives a clear and useful picture of how current language models work for legal tasks.
Large Language Modelslegal text simplificationVietnamese lawaccuracyreadabilityconsistencyerror analysislegal reasoningbenchmarkmodel evaluation
Authors
Van-Truong Le
Abstract
The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.