This Week In Computer Science Papers

Week beginning 23rd March 2026

Tap a tile to open details. Use the left sidebar to filter by category.

No filters applied
Showing 1–36 of 959
OccAny: Generalized Unconstrained Urban 3D Occupancy
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .
Open 2603.23502v1
MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical…
2026-03-24Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Languagearxiv
Abstract
Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
Open 2603.23501v1
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Genera…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
Open 2603.23500v1
DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.
Open 2603.23499v1
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Action…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.
Open 2603.23497v1
Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive…
2026-03-24Machine Learningarxiv
Abstract
Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA. Proof-of-concept is demonstrated through controlled experiments in Sandia's hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach~5 and Mach~8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21\%) and a mean AoA error of $0.44^{\circ} (8.25\%)$ on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment.
Open 2603.23496v1
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically se…
2026-03-24Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learningarxiv
Abstract
Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.
Open 2603.23495v1
Foveated Diffusion: Efficient Spatially Adaptive Image and Video Genera…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.
Open 2603.23491v1
Dynamic Light Spanners in Doubling Metrics
2026-03-24Computational GeometryData Structures and Algorithmsarxiv
Abstract
A $t$-spanner of a point set $X$ in a metric space $(\mathcal{X}, δ)$ is a graph $G$ with vertex set $P$ such that, for any pair of points $u,v \in X$, the distance between $u$ and $v$ in $G$ is at most $t$ times $δ(u,v)$. We study the problem of maintaining a spanner for a dynamic point set $X$ -- that is, when $X$ undergoes a sequence of insertions and deletions -- in a metric space of constant doubling dimension. For any constant $\varepsilon>0$, we maintain a $(1+\varepsilon)$-spanner of $P$ whose total weight remains within a constant factor of the weight of the minimum spanning tree of $X$. Each update (insertion or deletion) can be performed in $\operatorname{poly}(\log Φ)$ time, where $Φ$ denotes the aspect ratio of $X$. Prior to our work, no efficient dynamic algorithm for maintaining a light-weight spanner was known even for point sets in low-dimensional Euclidean space.
Open 2603.23490v1
AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video O…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.
Open 2603.23489v1
One View Is Enough! Monocular Training for In-the-Wild Novel View Gener…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
Open 2603.23488v1
TETO: Tracking Events with Teacher Observation for Motion Estimation an…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
Open 2603.23487v1
Failure of contextual invariance in gender inference with large languag…
2026-03-24Computation and LanguageArtificial IntelligenceComputers and Societyarxiv
Abstract
Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.
Open 2603.23485v1
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Percepti…
2026-03-24Computer Vision and Pattern RecognitionComputation and Languagearxiv
Abstract
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
Open 2603.23483v1
ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Acros…
2026-03-24Software EngineeringArtificial Intelligencearxiv
Abstract
Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.
Open 2603.23482v1
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyo…
2026-03-24RoboticsArtificial IntelligenceComputer Vision and Pattern Recognitionarxiv
Abstract
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.
Open 2603.23481v1
Stablecoins as Dry Powder: A Copula-Based Risk Analysis of Cryptocurren…
2026-03-24Computational Engineering, Finance, and Sciencearxiv
Abstract
Stablecoins serve as the fundamental infrastructure for Decentralised Finance (DeFi), acting as the primary bridge between fiat currencies and the digital asset ecosystem. While peg stability is well-documented, the structural role stablecoins play in transmitting systemic risk to the broader market remains under-explored. This study uses copula-based approaches to quantify the transmission of volatility and activity from stablecoin to cryptocurrency markets. We demonstrate in-sample causality across daily, weekly, and monthly horizons. Furthermore, we show that incorporating stablecoin factors significantly reduces Mean Squared Error in cryptocurrency forecasting. Specifically, we link stablecoin volume and upside volatility to broader market volatility, indicating its role as dry powder. Finally, we establish economic value by demonstrating reduced risk in a cryptocurrency volatility targeting model when stablecoin factors are employed.
Open 2603.23480v1
UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionali…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.
Open 2603.23478v1
Index-Based Scheduling for a Resource-Constrained Quantum Switch
2026-03-24Information TheoryNetworking and Internet Architecturearxiv
Abstract
We consider a quantum switch with a finite number of quantum memory registers that aims to serve multipartite entanglement requests among $N$ users. We propose scheduling policies that aim to optimize the average number of requests served per unit time by efficiently utilizing the switch's available memory. To measure the performance of the scheduling policies, we employ the newly introduced metric of age of entanglement establishment (AoEE). We formulate the scheduling problem in a restless multi-armed bandit (RMAB) framework. We show that the scheduling of entanglement requests is indexable. Subsequently, we find a closed-form expression of the Whittle index for all possible request-age pairs. By modeling the Whittle index of each request as its reward and its cardinality as its cost, we formulate the memory-constrained scheduling problem as a $0$-$1$ knapsack problem and solve it via dynamic programming. Furthermore, we consider two low-complexity sequential greedy policies that leverage two different modified Whittle indices.
Open 2603.23476v1
Evidence of political bias in search engines and language models before…
2026-03-24Computers and Societyarxiv
Abstract
Search engines (SEs) and large language models (LLMs) are central to political information access, yet their algorithmic decisions and potential underlying biases remain underexplored. We developed a standardized, privacy-preserving, bot-and-proxy methodology to audit four SEs and two LLMs before the 2024 European Parliament and US presidential elections. We collected answers to approximately 4,360 queries related to elections in five EU countries and 15 US counties, identified political entities and topics in those answers, and mapped them to ideological positions (EU) or issue associations (US). In Europe, SE results disproportionately mentioned far-right entities beyond levels expected from polls, past elections, or media salience. In the US, Google strongly favored topics more important to Republican voters, while other search engines favored issues more relevant to Democrats. LLMs responses were more balanced, although there is evidence of overrepresentation of far-right (and Green) entities. These results show evidence of bias and open important discussions on how even small skews in widely used platforms may influence democratic processes, calling for systematic audits of their outputs.
Open 2603.23474v1
Byzantine-Robust and Differentially Private Federated Optimization unde…
2026-03-24Machine LearningCryptography and Securityarxiv
Abstract
Federated Learning (FL) enables heterogeneous clients to collaboratively train a shared model without centralizing their raw data, offering an inherent level of privacy. However, gradients and model updates can still leak sensitive information, while malicious servers may mount adversarial attacks such as Byzantine manipulation. These vulnerabilities highlight the need to address differential privacy (DP) and Byzantine robustness within a unified framework. Existing approaches, however, often rely on unrealistic assumptions such as bounded gradients, require auxiliary server-side datasets, or fail to provide convergence guarantees. We address these limitations by proposing Byz-Clip21-SGD2M, a new algorithm that integrates robust aggregation with double momentum and carefully designed clipping. We prove high-probability convergence guarantees under standard $L$-smoothness and $σ$-sub-Gaussian gradient noise assumptions, thereby relaxing conditions that dominate prior work. Our analysis recovers state-of-the-art convergence rates in the absence of adversaries and improves utility guarantees under Byzantine and DP settings. Empirical evaluations on CNN and MLP models trained on MNIST further validate the effectiveness of our approach.
Open 2603.23472v1
Regulating AI Agents
2026-03-24Computers and Societyarxiv
Abstract
AI agents -- systems that can independently take actions to pursue complex goals with only limited human oversight -- have entered the mainstream. These systems are now being widely used to produce software, conduct business activities, and automate everyday personal tasks. While AI agents implicate many areas of law, ranging from agency law and contracts to tort liability and labor law, they present particularly pressing questions for the most globally consequential AI regulation: the European Union's AI Act. Promulgated prior to the development and widespread use of AI agents, the EU AI Act faces significant obstacles in confronting the governance challenges arising from this transformative technology, such as performance failures in autonomous task execution, the risk of misuse of agents by malicious actors, and unequal access to the economic opportunities afforded by AI agents. We systematically analyze the EU AI Act's response to these challenges, focusing on both the substantive provisions of the regulation and, crucially, the institutional frameworks that aim to support its implementation. Our analysis of the Act's allocation of monitoring and enforcement responsibilities, reliance on industry self-regulation, and level of government resourcing illustrates how a regulatory framework designed for conventional AI systems can be ill-suited to AI agents. Taken together, our findings suggest that policymakers in the EU and beyond will need to change course, and soon, if they are to effectively govern the next generation of AI technology.
Open 2603.23471v1
ConceptCoder: Improve Code Reasoning via Concept Learning
2026-03-24Software Engineeringarxiv
Abstract
Large language models (LLMs) have shown promising results for software engineering applications, but still struggle with code reasoning tasks such as vulnerability detection (VD). We introduce ConceptCoder, a fine-tuning method that simulates human code inspection: models are trained to first recognize code concepts and then perform reasoning on top of these concepts. In prior work, concepts are extracted by multimodal models or LLMs to explain vision and natural language models. Our work is the first to formulate concepts for code. We define code concepts as human-understandable semantic properties of code and train models to learn such concepts. Our evaluation shows that this approach significantly improves VD accuracy, from 66.32 to 72.15 F1 on average over 9 open-source LLMs. ConceptCoder achieves the best VD performance compared to state-of-the-art (SOTA) baselines, including fine-tuned SOTA open-source LLMs and prompted proprietary models such as GPT-5.2 and Claude-Opus-4.5. Our approach also scales: concepts defined from four types of vulnerabilities benefit general vulnerability datasets with 134 CWEs. We further demonstrate that concept-based fine-tuning generalizes beyond VD and improves branch prediction. We release our code and datasets at https://figshare.com/s/1decab8232c653b44f71.
Open 2603.23470v1
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
2026-03-24Computer Vision and Pattern RecognitionArtificial Intelligencearxiv
Abstract
Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.
Open 2603.23463v1
RealMaster: Lifting Rendered Scenes into Photorealistic Video
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.
Open 2603.23462v1
End-to-End Efficient RL for Linear Bellman Complete MDPs with Determini…
2026-03-24Machine Learningarxiv
Abstract
We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.
Open 2603.23461v1
CSTS: A Canonical Security Telemetry Substrate for AI-Native Cyber Dete…
2026-03-24Cryptography and SecurityMachine Learningarxiv
Abstract
AI-driven cybersecurity systems often fail under cross-environment deployment due to fragmented, event-centric telemetry representations. We introduce the Canonical Security Telemetry Substrate (CSTS), an entity-relational abstraction that enforces identity persistence, typed relationships, and temporal state invariants. Across heterogeneous environments, CSTS improves cross-topology transfer for identity-centric detection and prevents collapse under schema perturbation. For zero-day detection, CSTS isolates semantic orientation instability as a modeling, not schema, phenomenon, clarifying layered portability requirements.
Open 2603.23459v1
SNARE: A TRAP for Rational Players to Solve Byzantine Consensus in the…
2026-03-24Computer Science and Game TheoryDistributed, Parallel, and Cluster Computingarxiv
Abstract
The TRAP protocol solves rational agreement by combining accountable consensus with a one-shot BFTCR finalization phase. We present SNARE (Scalable Nash Agreement via Reward and Exclusion), the adaptation of TRAP to $n=5f{+}1$, and prove $ε$-$(k,t)$-robustness for rational agreement tolerating coalitions up to ${\approx}73\%$ with deposits under $0.5\%$ of the gain. A central finding is that appending a single all-to-all broadcast round with the $4f{+}1$ threshold after predecisions yields $ε$-$(k,t)$-robustness for coalitions up to $3f$ (${\approx}60\%$) without any deposit: we need not model or know the utility function of deviating players, only that they participate in the protocol. These players can be \emph{deceitful} (arbitrary unknown utility), not just rational, and the finalization structure prevents disagreement regardless of their motivation. This observation is protocol-agnostic, applies to any $5f{+}1$ protocol at the cost of one message delay that runs concurrently with the next view, and does not require commit-reveal mechanisms. Above $60\%$, the full baiting mechanism with deposits under $0.5\%$ extends tolerance to ${\approx}73\%$. A second finding is that valid-candidacy, the property preventing reward front-running, holds unconditionally regardless of the quorum threshold, removing both the $n>2(k{+}t)$ and $n>\frac{3}{2}k{+}3t$ constraints from the original TRAP. This retroactively extends the $3f{+}1$ bound from $C<n/2$ to $C<5n/9$. The binding constraint in both models is the winner consensus operating on $2f$ residual players after excluding $3f{+}1$ detected equivocators. We explore avenues for relaxing this limit.
Open 2603.23458v1
DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object De…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO
Open 2603.23455v1
Code Review Agent Benchmark
2026-03-24Software EngineeringArtificial Intelligencearxiv
Abstract
Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.
Open 2603.23448v1
3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City…
2026-03-24Computer Vision and Pattern RecognitionArtificial Intelligencearxiv
Abstract
While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.
Open 2603.23447v1
MRATTS: An MR-Based Acupoint Therapy Training System with Real-Time Acu…
2026-03-24Human-Computer InteractionMultimediaarxiv
Abstract
Acupoint therapy is a core therapeutic method of Traditional Chinese Medicine (TCM), and it requires a high level of expertise and skills to detect acupoints and perform acupuncture and moxibustion. Existing mixed reality (MR)-based training methods often fall short in accurate real-time detection and visualization of acupoints on the hand, limb, or torso of a real person and do not support various techniques of acupuncture and moxibustion. Moreover, evaluation standards and visual guidance with fine details for each step during MR-based training are typically missing. To this end, we propose the MR-based TCM Acupoint Therapy Teaching System (MRATTS)--an MR-based acupoint therapy teaching and training framework. MRATTS is based on a real-time hand, limb, and torso acupoint detection method to accurately track and visualize acupoints on real patients through MR. On top of that, in collaboration with an experienced acupoint therapist, we design a practice method with interactive visual guidance for various acupoint therapy techniques that simulate acupressure, acupuncture (insertion, lifting-thrusting, and twisting), and moxibustion (mild, sparrow-pecking, and whirling). A set of TCM theory-based evaluation standards is formulated within MRATTS to enable the scoring and visualization of the accuracy and proficiency of acupoint therapy. The effectiveness and usefulness of MRATTS are evaluated through a controlled user study and expert feedback. Results of the study indicate that the MRATTS group shows clear improvements in understanding 3D locations of acupoints and proficiency in acupoint therapy compared to control groups.
Open 2603.23445v1
Evaluating LLM-Based Test Generation Under Software Evolution
2026-03-24Software EngineeringArtificial Intelligencearxiv
Abstract
Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.
Open 2603.23443v1
MuSe: a Mutation Testing Plugin for the Remix IDE
2026-03-24Software Engineeringarxiv
Abstract
Mutation testing is a technique to assess the effectiveness of test suites by introducing artificial faults into programs. Although mutation testing plugins are available for many platforms and languages, none is currently available for Remix-IDE, the most widely used Integrated Development Environment for the entire contract development journey, used by users of all knowledge levels, and serves as a learning lab for teaching and experimenting with Ethereum. The quality and security of smart contracts are crucial in blockchain systems, as even minor issues can result in substantial financial losses. This paper proposes MuSe, a mutation testing plugin for the Remix-IDE. MuSe includes traditional, Solidity-specific, and security-oriented mutation operators. Its integration into the Remix-IDE eliminates the need for additional setup and lowers the entry barrier. As a result, developers and researchers can immediately leverage mutation testing to assess the effectiveness of their test suites and identify potential issues in smart contracts. We provide a demo video showing MuSe: https://www.youtube.com/watch?v=MIFk9exTDu0 and its repository: https://github.com/GerardoIuliano/MuSe-Remix-Plugin.
Open 2603.23441v1
SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seism…
2026-03-24Computer Vision and Pattern Recognitionarxiv
Abstract
Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \textbf{SIGMA}, a new physics-based dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.
Open 2603.23439v1
Targeted Adversarial Traffic Generation : Black-box Approach to Evade I…
2026-03-24Cryptography and SecurityArtificial Intelligencearxiv
Abstract
The integration of machine learning (ML) algorithms into Internet of Things (IoT) applications has introduced significant advantages alongside vulnerabilities to adversarial attacks, especially within IoT-based intrusion detection systems (IDS). While theoretical adversarial attacks have been extensively studied, practical implementation constraints have often been overlooked. This research addresses this gap by evaluating the feasibility of evasion attacks on IoT network-based IDSs, employing a novel black-box adversarial attack. Our study aims to bridge theoretical vulnerabilities with real-world applicability, enhancing understanding and defense against sophisticated threats in modern IoT ecosystems. Additionally, we propose a defense scheme tailored to mitigate the impact of evasion attacks, thereby reinforcing the resilience of ML-based IDSs. Our findings demonstrate successful evasion attacks against IDSs, underscoring their susceptibility to advanced techniques. In contrast, we proposed a defense mechanism that exhibits robust performance by effectively detecting the majority of adversarial traffic, showcasing promising outcomes compared to current state-of-the-art defenses. By addressing these critical cybersecurity challenges, our research contributes to advancing IoT security and provides insights for developing more resilient IDS.
Open 2603.23438v1