Diagnosing CFG Interpretation in LLMs

2026-04-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors tested large language models (LLMs) to see if they can understand new, made-up grammar rules and produce correct outputs that follow those rules and meanings. They created RoboGrid to check how well LLMs handle different levels of complexity, like deep nesting and complicated structures. Their results showed that while LLMs often get the surface form right, they struggle to keep the deeper, meaningful structure, especially with very complex or deeply recursive inputs. Even using reasoning techniques helps only a little, and LLMs seem to rely on familiar keywords rather than truly understanding the abstract rules. This highlights a weakness in LLMs’ ability to track hierarchical states needed for flexible, grammar-independent tasks.

Large Language ModelsContext-Free GrammarSyntaxSemanticsRecursionChain-of-Thought ReasoningSymbolic InductionHierarchical State-TrackingAgentic Systems
Authors
Hanqi Li, Lu Chen, Kai Yu
Abstract
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.