Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation

2026-04-09Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors study how stories teach values differently across cultures and languages. They introduce a new task that asks language models to generate morals from stories in 14 different cultural contexts. Their tests show that advanced models like GPT-4o and Gemini can create morals similar to humans but tend to focus on common values, missing much cultural variety. This work highlights that models are good at general ideas but less good at capturing diverse cultural interpretations of stories.

multilingual story moral generationcultural contextlanguage modelssemantic similarityhuman preference surveyvalue categorizationGPT-4oGeminicross-linguistic variationcultural alignment
Authors
Sophie Wu, Andrew Piper
Abstract
Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the diversity that characterizes human narrative understanding. By framing narrative interpretation as an evaluative task, this work introduces a new approach to studying cultural alignment in language models beyond static benchmarks or knowledge-based tests.