Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

2026-06-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied how to make computer models grade beginner C++ programming assignments more like human instructors by using grading rubrics and multiple related tasks during training. They trained transformer models to predict both exact numeric scores and letter-grade categories, adding ways to better match the overall distribution of grades given by teachers. Their experiments showed that combining rubric information and using flexible label formats helped the model's grades align more closely with instructor grading patterns than simpler methods. They found that certain model setups, like fully fine-tuned T5, further improved how well the predicted grades matched the real grade distribution.

transformer modelsautomated gradingmultitask learninggrading rubricsBARTLoRA adaptationmean absolute errorgrade distributionT5 modelpairwise pretraining
Authors
Kelsey Rainey, Jesse Roberts
Abstract
This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.