Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

2026-03-31Computers and Society

Computers and Society
AI summary

The authors tested six large language models (LLMs) to see if they could reliably explain Romanian Senate law proposals like official legislative documents do. They found that top models (from OpenAI and Anthropic) came very close to the official reasoning, while others fell notably short. However, all models sometimes made up believable but incorrect explanations, especially for unusual political proposals. The authors suggest that errors grow because both the advice-giving models and the people using them have limited knowledge and data. They caution that the main problem is not clear bias but gaps in what the models have learned.

large language modelslegislative reasoningRomanian Senatesemantic similarityprincipal-agent theorybounded rationalityconfabulationEU directive transpositiontraining data coverage
Authors
Iulian Lucău, Adelin-George Voicu
Abstract
This paper evaluates whether commercial large language models (LLMs) can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.