Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions
2026-06-02 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors studied how well large language models (LLMs) can help with fixing consumer devices like phones and computers. They created a test using 991 real repair questions from Reddit, including both English and Bangla versions. They found that while LLMs can give some helpful advice, they often make mistakes that could be risky, especially for phone repairs involving detailed hardware issues. The models performed worse when answering in Bangla than in English. Among the models tested, GPT-5.4 gave the best overall results.
Large Language ModelsConsumer Device RepairDiagnostic ReasoningCross-lingual EvaluationSafety in AIPhone RepairData RecoveryTechnician SolutionsGPT-5.4
Authors
Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana
Abstract
Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.