Differentiable Zero-One Loss via Hypersimplex Projections

2026-02-26 • Machine Learning

Machine Learning

AI summaryⓘ

The authors developed a new way to approximate the zero-one loss, a key but tricky measure for classification that can't be used directly with gradient-based learning because it's not smooth. They created a smooth method called Soft-Binary-Argmax that works by projecting onto a special geometric shape, making the function easier to optimize. This method allows for better integration into machine learning models, especially improving how well they learn with large batches of data by enforcing consistency in the output. Their work helps reduce common performance issues seen when training models with large amounts of data at once.

zero-one lossdifferentiable optimizationhypersimplexSoft-Binary-ArgmaxJacobianbinary classificationmulticlass classificationlarge-batch traininginductive biasgradient-based optimization

Authors

Camilo Gomez, Pengyang Wang, Liansheng Tang

Abstract

Recent advances in machine learning have emphasized the integration of structured optimization components into end-to-end differentiable models, enabling richer inductive biases and tighter alignment with task-specific objectives. In this work, we introduce a novel differentiable approximation to the zero-one loss-long considered the gold standard for classification performance, yet incompatible with gradient-based optimization due to its non-differentiability. Our method constructs a smooth, order-preserving projection onto the n,k-dimensional hypersimplex through a constrained optimization framework, leading to a new operator we term Soft-Binary-Argmax. After deriving its mathematical properties, we show how its Jacobian can be efficiently computed and integrated into binary and multiclass learning systems. Empirically, our approach achieves significant improvements in generalization under large-batch training by imposing geometric consistency constraints on the output logits, thereby narrowing the performance gap traditionally observed in large-batch training.

View PDFOpen arXiv