How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
2026-03-03 • Robotics
RoboticsArtificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors address the challenge of teaching robots to perform delicate tasks like peeling fruits and vegetables, which are hard to measure and define precisely. They created a two-step learning method: first, robots learn a basic peeling skill using force data and imitation; second, they improve this skill by learning from human preferences and feedback. Their approach works well even on different types of produce, achieving high success rates with relatively little training data. This shows their method helps robots better match human ideas of quality for tricky tasks.
robotic manipulationimitation learningpreference learningreward modelforce feedbackzero-shot generalizationdata collectiontask qualityfruit peelingpolicy refinement
Authors
Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik
Abstract
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.