AI translation of literary texts is "fine", but readers still prefer human translations

2026-06-24Computation and Language

Computation and Language
AI summary

The authors studied how readers experience AI-generated translations of novels compared to human translations. They asked readers to evaluate excerpts from books originally in French, Polish, and Japanese, translated into English by both humans and AI. Readers generally found AI translations acceptable but preferred human ones for clarity and immersion. Interestingly, readers often could not reliably tell which translation was AI-made and liked the one they thought was human. The authors also found that automatic evaluation tools do not match readers’ actual preferences and have shared their detailed dataset for further research.

machine translationhuman translationlarge language modelliterary translationreader evaluationimmersive readingautomatic metricstranslation qualitydatasetAI in literature
Authors
Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska
Abstract
AI translation of literary works is increasingly common. While the content may be rendered adequately, we do not know enough about how readers experience it in terms of immersiveness and literary effect, aspects poorly captured by automatic machine translation metrics or human evaluation targeting fluency and adequacy. We ask 15 avid readers to compare recently published human translations (HT) to machine translations (MT) generated with an agentic large language model (LLM)-based pipeline, for 15 recent novels in French, Polish, and Japanese and translated into English. Readers evaluated approximately 8K-word excerpts in two conditions: immersive reading of the whole excerpt (30 comparisons) and close reading of 386 aligned HT-MT chunk pairs (772 comparisons), with two readers per book and in alternating order of presentation. Overall, readers find MT "fine", but prefer HT (slightly at excerpt-level 19/30, more clearly at chunk-level 522/772) for its ease, clarity, and immersive nature. Readers' highlights show that MT's quality varies more within one book than HT's does. Crucially, readers cannot reliably tell the two apart (17/30 guess correctly) and tend to prefer the version they believe to be human. Automatic metrics, including LLM-as-a-judge approaches, fail to recover reader preferences and favor MT. We release LAIT (Literary AI Translation), a reader-centered evaluation dataset with 1K reader comments, 2K judgments and preference ratings, and 7.2K span-level annotations, along with our evaluation protocol and supporting interface.