Using Learning Progressions to Guide AI Feedback for Science Learning

2026-03-03Computation and Language

Computation and Language
AI summary

The authors studied two ways to create AI feedback for middle school students' chemistry explanations: one using expert-made rubrics and the other using rubrics automatically created from learning progressions (LPs). They compared the quality of the feedback by looking at clarity, accuracy, relevance, engagement, motivation, and reflectiveness. Human reviewers found no significant difference in feedback quality between the two methods. This suggests that using LPs to generate rubrics could be a useful, less time-consuming alternative to expert-created rubrics for AI feedback.

Generative AIFormative feedbackRubricLearning progressionsAI-generated feedbackMiddle school chemistryFeedback qualityInter-rater reliability
Authors
Xin Xia, Nejla Yuruk, Yun Wang, Xiaoming Zhai
Abstract
Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.