D$^3$-MoE:Dual Disentangled Diffusion Mixture-of-Experts for Style-Controllable End-to-End Autonomous Driving

2026-06-03Robotics

Robotics
AI summary

The authors address a common problem in self-driving car systems where learned driving styles tend to blend together, making the car's behavior bland and sometimes unsafe. They propose a new method, D³-MoE, that separates driving style choice from the actual driving path generation, allowing multiple styles to be created and then selected based on preference or safety scores. They also separate controls for forward/backward movement and side-to-side steering, training these parts without manual labels to improve accuracy. Their tests show their system plans safer and more varied driving paths than previous methods. This means their approach can better mimic different human driving styles while keeping the car's movements physically realistic.

autonomous drivingtrajectory modelingdiffusion processstyle conditioningmixture of expertskinematic safetyself-supervised learningtransformersmulti-modal planningbenchmark evaluation
Authors
Renju Feng, Rukang Wang, Ning Xi, Jianguo Yu, Liping Lu, Pan Zhou, Duanfeng Chu
Abstract
Traditional end-to-end autonomous driving frameworks frequently suffer from the "style-averaging" dilemma when trained on high-variance human demonstrations, yielding homogenized, style-uncontrollable, and even kinematically unsafe policies. To overcome this limitation, we present D$^3$-MoE (Dual Disentangled Diffusion Mixture-of-Experts), which disentangles trajectory modeling along two complementary axes. On the behavioral axis, generation is decoupled from selection: a style-conditioned diffusion process synthesizes multi-style candidate trajectories in parallel within a single scene, allowing a downstream module to select the optimal trajectory based on user preference or an evaluation score. On the physical axis, decoupled longitudinal and lateral routers activate their respective experts during inference time, trained without manual labels using self-supervised targets from orthogonal ground-truth kinematics. These activated experts, architected as Diffusion Transformers (DiT) and equipped with style-conditioned AdaLN and asymmetric lateral-fusion cross-attention, independently predict their corresponding physical state before being reassembled into a unified, kinematically coherent trajectory. Extensive evaluations on the challenging NAVSIM benchmark demonstrate that D$^3$-MoE achieves state-of-the-art planning performance, reaching 88.2 PDMS and 84.3 EPDMS by default. Moreover, our Best-of-Three ensemble strategy effectively broadens the multi-modal solution space, raising performance to 91.3 PDMS and 87.5 EPDMS. Both quantitative and qualitative analyses jointly confirm the framework's advantages in planning quality and style controllability.