Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

2026-04-28Artificial Intelligence

Artificial Intelligence
AI summary

The authors study how to manage city-wide fleets of electric ride-hailing vehicles, focusing on when and where to send vehicles, recharge them, and handle limited charger capacities and power grid constraints. They model this as a complex decision process with mixed actions and use a smart system that plans intentions while strictly obeying charging and power limits. To handle uncertain demand and travel times, the authors use a robust reinforcement learning method that accounts for spatial correlations in data. Testing on real NYC taxi data shows their approach earns more profit than several existing algorithms without overloading the power grid.

electric-vehicle (EV) fleetssemi-Markov decision process (semi-MDP)mixed-integer linear program (MILP)reinforcement learning (RL)Soft Actor-Critic (SAC)Graph Convolutional Network (GCN)Wasserstein ambiguity setspatial correlationsfeeder limitsride-hailing dispatch
Authors
An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui
Abstract
We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.