This Week In Computer Science Papers
Week beginning 20th April 2026
Tap a tile to open details. Use the left sidebar to filter by category.
No filters applied
Showing 1–36 of 1510
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Awar…
2026-04-22Computation and LanguageArtificial IntelligenceSoundarxiv
Abstract
Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.
Open → 2604.20842v1
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Vi…
2026-04-22Computer Vision and Pattern Recognitionarxiv
Abstract
Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.
Open → 2604.20841v1
Dynamic Construction of the Lovász Local Lemma
2026-04-22Data Structures and Algorithmsarxiv
Abstract
This paper proves that a wide class of local search algorithms extend as is to the fully dynamic setting with an adaptive adversary, achieving an amortized $\tilde{O}(1)$ number of local-search steps per update. A breakthrough by Moser (2009) introduced the witness-tree and entropy compression techniques for analyzing local resampling processes for the Lovász Local Lemma. These methods have since been generalized and expanded to analyze a wide variety of local search algorithms that can efficiently find solutions to many important local constraint satisfaction problems. These algorithms either extend a partial valid assignment and backtrack by unassigning variables when constraints become violated, or they iteratively fix violated constraints by resampling their variables. These local resampling or backtracking procedures are incredibly flexible, practical, and simple to specify and implement. Yet, they can be shown to be extremely efficient on static instances, typically performing only (sub)-linear number of fixing steps. The main technical challenge lies in proving conditions that guarantee such rapid convergence. This paper extends these convergence results to fully dynamic settings, where an adaptive adversary may add or remove constraints. We prove that applying the same simple local search procedures to fix old or newly introduced violations leads to a total number of resampling steps near-linear in the number of adversarial updates. Our result is very general and yields several immediate corollaries. For example, letting $Δ$ denote the maximum degree, for a constant $ε$ and $Δ= \text{poly}(\log n)$, we can maintain a $(1+ε) Δ$-edge coloring in $\text{poly}(\log n)$ amortized update time against an adaptive adversary. The prior work for this regime has exponential running time in $\sqrt{\log n}$ [Christiansen, SODA '26].
Open → 2604.20836v1
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer f…
2026-04-22Computation and Languagearxiv
Abstract
Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.
Open → 2604.20835v1
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comp…
2026-04-22Roboticsarxiv
Abstract
Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
Open → 2604.20834v1
AVISE: Framework for Evaluating the Security of AI Systems
2026-04-22Cryptography and SecurityArtificial IntelligenceComputation and Languagearxiv
Abstract
As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases and an Evaluation Language Model (ELM) that determines whether each test case was able to jailbreak the target model, achieving 92% accuracy, an F1-score of 0.91, and a Matthews correlation coefficient of 0.83. We evaluate nine recently released language models of diverse sizes with the SET and find that all are vulnerable to the augmented Red Queen attack to varying degrees. AVISE provides researchers and industry practitioners with an extensible foundation for developing and deploying automated SETs, offering a concrete step toward more rigorous and reproducible AI security evaluation.
Open → 2604.20833v1
An Analysis of Attack Vectors Against FIDO2 Authentication
2026-04-22Cryptography and Securityarxiv
Abstract
Phishing attacks remain one of the most prevalent threats to online security, with the Anti-Phishing Working Group reporting over 890,000 attacks in Q3 2025 alone. Traditional password-based authentication is particularly vulnerable to such attacks, prompting the development of more secure alternatives. This paper examines passkeys, also known as FIDO2, which claim to provide phishing-resistant authentication through asymmetric cryptography. In this approach, a private key is stored on a user's device, the authenticator, while the server stores the corresponding public key. During authentication, the server generates a challenge that the user signs with the private key; the server then verifies the signature and establishes a session. We present passkey workflows and review state-of-the-art attack vectors from related work alongside newly identified approaches. Two attacks are implemented and evaluated: the Infected Authenticator attack, which generates attacker-known keys on a corrupted authenticator, and the Authenticator Deception attack, which spoofs a target website by modifying the browser's certificate authority store, installing a valid certificate, and intercepting user traffic. An attacker relays a legitimate challenge from the real server to a user, who signs it, allowing the attacker to authenticate as the victim. Our results demonstrate that successful attacks on passkeys require substantial effort and resources. The claim that passkeys are phishing-resistant largely holds true, significantly raising the bar compared to traditional password-based authentication.
Open → 2604.20826v1
FedSIR: Spectral Client Identification and Relabeling for Federated Lea…
2026-04-22Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognitionarxiv
Abstract
Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.
Open → 2604.20825v1
Closing the Domain Gap in Biomedical Imaging by In-Context Control Samp…
2026-04-22Machine Learningarxiv
Abstract
The central problem in biomedical imaging are batch effects: systematic technical variations unrelated to the biological signal of interest. These batch effects critically undermine experimental reproducibility and are the primary cause of failure of deep learning systems on new experimental batches, preventing their practical use in the real world. Despite years of research, no method has succeeded in closing this performance gap for deep learning models. We propose Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), a meta-learning adaptation method that exploits negative control samples. Such unperturbed reference images are present in every experimental batch by design and serve as stable context for adaptation. We validate our novel method on Mechanism-of-Action (MoA) classification, a crucial task for drug discovery, on the large-scale JUMP-CP dataset. The accuracy of standard ResNets drops from 0.939 $\pm$ 0.005, on the training domain, to 0.862 $\pm$ 0.060 on data from new experimental batches. Foundation models, even after Typical Variation Normalization, fail to close this gap. We are the first to show that meta-learning approaches close the domain gap by achieving 0.935 $\pm$ 0.018. If the new experimental batches exhibit strong domain shifts, such as being generated in a different lab, meta-learning approaches can be stabilized with control samples, which are always available in biomedical experiments. Our work shows that batch effects in bioimaging data can be effectively neutralized through principled in-context adaptation, which also makes them practically usable and efficient.
Open → 2604.20824v1
From Meme to Method: Rethinking Animal Adoption Platforms through the C…
2026-04-22Human-Computer Interactionarxiv
Abstract
The internet folklore of the Cat Distribution System (CDS) humorously suggests that cats are "assigned" to people rather than intentionally sought. Beyond its playful origins, CDS reflects a culturally resonant way people perceive and engage in adoption, and this user context can guide the redesign and improvement of adoption systems. In the Philippines, where an estimated 13.11 million stray cats and dogs place the country sixth worldwide in overpopulation, this framing offers a novel way to rethink adoption platforms. We developed a prototype application inspired by CDS principles, focusing on features such as algorithmic matchmaking, community reporting, and proximity-based discovery. An initial evaluation with potential users (n=35) indicated that the system was positively received for its ease of use and its alignment with users' intuitive expectations, though participants highlighted areas for improvement in transparency of matchmaking and owner-adopter communication. The findings suggest that culturally embedded metaphors like CDS can shape mental models, making adoption processes feel more serendipitous and less transactional.
Open → 2604.20823v1
Global Offshore Wind Infrastructure: Deployment and Operational Dynamic…
2026-04-22Computer Vision and Pattern RecognitionMachine Learningarxiv
Abstract
The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.
Open → 2604.20822v1
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexib…
2026-04-22Machine LearningDistributed, Parallel, and Cluster Computingarxiv
Abstract
The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.
Open → 2604.20819v1
Convergent Evolution: How Different Language Models Learn Similar Numbe…
2026-04-22Computation and LanguageArtificial IntelligenceMachine Learningarxiv
Abstract
Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.
Open → 2604.20817v1
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Cont…
2026-04-22Machine LearningComputer Vision and Pattern Recognitionarxiv
Abstract
Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.
Open → 2604.20816v1
Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss W…
2026-04-22Computer Vision and Pattern Recognitionarxiv
Abstract
Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.
Open → 2604.20813v1
Diagnosing CFG Interpretation in LLMs
2026-04-22Artificial Intelligencearxiv
Abstract
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.
Open → 2604.20811v1
DNA storage approaching the information-theoretic ceiling
2026-04-22Information Theoryarxiv
Abstract
Synthetic DNA approaches 227.5 exabytes per gram of storage density with stability over millennial timescales. Realising this capacity requires error-correction codes that recover data from substantial synthesis and sequencing errors. Existing codecs convert noisy sequencer output into discrete base calls before error correction, discarding probabilistic information about which positions are reliable. Here we present a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25 °C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon bound of the underlying channel.
Open → 2604.20810v1
Formal Primal-Dual Algorithm Analysis
2026-04-22Logic in Computer ScienceDiscrete MathematicsData Structures and Algorithmsarxiv
Abstract
We present an ongoing effort to build a framework and a library in Isabelle/HOL for formalising primal-dual arguments for the analysis of algorithms. We discuss a number of example formalisations from the theory of matching algorithms, covering classical algorithms like the Hungarian Method, widely considered the first primal-dual algorithm, and modern algorithms like the Adwords algorithm, which models the assignment of search queries to advertisers in the context of search engines.
Open → 2604.20807v1
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vi…
2026-04-22Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Languagearxiv
Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
Open → 2604.20806v1
Relative Principals, Pluralistic Alignment, and the Structural Value Al…
2026-04-22Computers and SocietyArtificial IntelligenceMultiagent Systemsarxiv
Abstract
The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.
Open → 2604.20805v1
Autonomous LLM-generated Feedback for Student Exercises in Introductory…
2026-04-22Software Engineeringarxiv
Abstract
Introductory Software Engineering (SE) courses face rapidly increasing student enrollment numbers, participants with diverse backgrounds and the influence of Generative AI (GenAI) solutions. High teacher-to-student ratios often challenge providing timely, high-quality, and personalized feedback a significant challenge for educators. To address these challenges, we introduce NAILA, a tool that provides 24/7 autonomous feedback for student exercises. Utilizing GenAI in the form of modern LLMs, NAILA processes student solutions provided in open document formats, evaluating them against teacher-defined model solutions through specialized prompt templates. We conducted an empirical study involving 900+ active students at the University of Duisburg-Essen to assess four main research questions investigating (1) the underlying motivations that drive students to either adopt or reject NAILA, (2) user acceptance by measuring perceived usefulness and ease of use alongside subjective learning progress, (3) how often and how consistently students engage with NAILA, and (4) how using NAILA to receive AI feedback impacts on academic performance compared to human feedback.
Open → 2604.20803v1
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
2026-04-22Cryptography and Securityarxiv
Abstract
LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several-fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback-driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).
Open → 2604.20801v1
LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image
2026-04-22Computer Vision and Pattern RecognitionMachine Learningarxiv
Abstract
Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code & models will be public at https://anticdimi.github.io/lexis.
Open → 2604.20800v1
A Hough transform approach to safety-aware scalar field mapping using G…
2026-04-22Roboticsarxiv
Abstract
This paper presents a framework for mapping unknown scalar fields using a sensor-equipped autonomous robot operating in unsafe environments. The unsafe regions are defined as regions of high-intensity, where the field value exceeds a predefined safety threshold. For safe and efficient mapping of the scalar field, the sensor-equipped robot must avoid high-intensity regions during the measurement process. In this paper, the scalar field is modeled as a sample from a Gaussian process (GP), which enables Bayesian inference and provides closed-form expressions for both the predictive mean and the uncertainty. Concurrently, the spatial structure of the high-intensity regions is estimated in real-time using the Hough transform (HT), leveraging the evolving GP posterior. A safe sampling strategy is then employed to guide the robot towards safe measurement locations, using probabilistic safety guarantees on the evolving GP posterior. The estimated high-intensity regions also facilitate the design of safe motion plans for the robot. The effectiveness of the approach is verified through two numerical simulation studies and an indoor experiment for mapping a light-intensity field using a wheeled mobile robot.
Open → 2604.20799v1
Gauge-Equivariant Graph Neural Networks for Lattice Gauge Theories
2026-04-22Machine Learningarxiv
Abstract
Local gauge symmetry underlies fundamental interactions and strongly correlated quantum matter, yet existing machine-learning approaches lack a general, principled framework for learning under site-dependent symmetries, particularly for intrinsically nonlocal observables. Here we introduce a gauge-equivariant graph neural network that embeds non-Abelian symmetry directly into message passing via matrix-valued, gauge-covariant features and symmetry-compatible updates, extending equivariant learning from global to fully local symmetries. In this formulation, message passing implements gauge-covariant transport across the lattice, allowing nonlocal correlations and loop-like structures to emerge naturally from local operations. We validate the approach across pure gauge, gauge-matter, and dynamical regimes, establishing gauge-equivariant message passing as a general paradigm for learning in systems governed by local symmetry.
Open → 2604.20797v1
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Dif…
2026-04-22Computer Vision and Pattern Recognitionarxiv
Abstract
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Open → 2604.20796v1
Automatic Ontology Construction Using LLMs as an External Layer of Memo…
2026-04-22Artificial Intelligencearxiv
Abstract
This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.
Open → 2604.20795v1
Fresh Masking Makes NTT Pipelines Composable: Machine-Checked Proofs fo…
2026-04-22Cryptography and Securityarxiv
Abstract
Post-quantum cryptographic (PQC) accelerators for ML-KEM (FIPS 203) and ML-DSA (FIPS 204) rely on pipelined Number Theoretic Transform (NTT) stages over $\mathbb{Z}_q$. Our prior work established structural dependency analysis at scale [1] and quantified the security margin of partial NTT masking [2]. Whether per-stage arithmetic masking guarantees pipeline-level security had no prior machine-checked answer for the r-bearing case: composition frameworks (ISW, t-SNI, PINI, DOM) were formalized exclusively for Boolean masking over $\mathrm{GF}(2)$; no proof assistant artifact addresses the NTT butterfly over $\mathbb{Z}_q$. We present three machine-checked results in Lean 4 with Mathlib, all zero sorry. First, we close a stated limitation of prior work: value-independence implies constant marginal distribution under fresh randomness (via an algebraic MutualInfoZero proxy). Second, butterfly per-context uniformity: for any Cooley-Tukey butterfly with fresh output mask over $\mathbb{Z}/q\mathbb{Z}$ ($q > 0$), each output wire has exactly one mask value producing each output, a uniform marginal independent of secrets, universal over all moduli, twiddle factors, and inputs. Third, a k-stage NTT pipeline with fresh per-stage masking satisfies per-context uniformity at every stage under the ISW first-order probing model. We document a named warning: pointwise value-independence is false for butterfly outputs. The Adams Bridge accelerator (CHIPS Alliance Caliptra) fails the fresh masking hypothesis, masking active only in INTT round 0, architecturally explaining its structural insecurity. Artifact: nine theorems, 1,738 build jobs, zero sorry. Composition for nonlinear gadgets (Barrett) is addressed in forthcoming manuscripts proving Barrett's PF-PINI(2) satisfaction ('one-bit barrier') [3] and k-stage composition for PF-PINI gadgets under fresh-mask renewal [4].
Open → 2604.20793v1
Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in…
2026-04-22Computation and LanguageArtificial Intelligencearxiv
Abstract
Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.
Open → 2604.20791v1
Working Memory Constraints Scaffold Learning in Transformers under Data…
2026-04-22Computation and LanguageArtificial IntelligenceMachine Learningarxiv
Abstract
We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.
Open → 2604.20789v1
Designing Approximate Binary Trees for Trees
2026-04-22Data Structures and Algorithmsarxiv
Abstract
We study the following problem that is motivated by demand-aware network design: Given a tree~$G$, the task is to find a binary tree~$H$ on the same vertex set. The objective is to minimize the sum of distances in~$H$ between vertex pairs that are adjacent in~$G$. We present a linear-time factor-4 approximation for this problem.
Open → 2604.20786v1
GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Spa…
2026-04-22Computer Vision and Pattern Recognitionarxiv
Abstract
Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.
Open → 2604.20784v1
Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incom…
2026-04-22Machine Learningarxiv
Abstract
Internal ice layers imaged by radar provide key evidence of snow accumulation and ice dynamics, but radar-derived layer boundary observations are often incomplete, with discontinuous traces and sometimes entirely missing layers, due to limited resolution, sensor noise, and signal loss. Existing graph-based models for ice stratigraphy generally assume sufficiently complete layer profiles and focus on predicting deeper-layer thickness from reliably traced shallow layers. In this work, we address the layer-completion problem itself by synthesizing complete ice-layer thickness annotations from incomplete radar-derived layer traces by conditioning on colocated physical features synchronized from physical climate models. The proposed network combines geometric learning to aggregate within-layer spatial context with a transformer-based temporal module that propagates information across layers to encourage coherent stratigraphy and consistent thickness evolution. To learn from incomplete supervision, we optimize a mask-aware robust regression objective that evaluates errors only at observed thickness values and normalizes by the number of valid entries, enabling stable training under varying sparsity without imputation and steering completions toward physically plausible values. The model preserves observed thickness where available and infers only missing regions, recovering fragmented segments and even fully absent layers while remaining consistent with measured traces. As an additional benefit, the synthesized thickness stacks provide effective pretraining supervision for a downstream deep-layer predictor, improving fine-tuned accuracy over training from scratch on the same fully traced data.
Open → 2604.20783v1
Designing a Visualization Atlas: Lessons & Reflections from The UK Co-B…
2026-04-22Human-Computer Interactionarxiv
Abstract
This paper reports on the process of designing the UK Co-Benefits Atlas, which communicates and publicizes data for climate mitigation. Visualization atlases -- an emerging type of platform to make data about complex topics comprehensive through interactive visualizations and explanatory content -- pose challenges beyond traditional visualization projects. Atlases must address diverse and often uncertain audiences and use cases, support both explanatory and guided exploration, and accommodate complex, evolving data. Over 10 months, our team of visualization and domain experts conducted 8 design workshops, iterative prototyping, 15 stakeholder onboarding sessions, and continuous reflection. These intertwined processes informed the development of the Atlas, comprising over 400 pages of visualizations and explanations. They also enabled a deeper understanding of how stakeholders may critically engage with the atlas in practice, in terms of interests, potential frictions when navigating huge amounts of data, and envisioned usage scenarios. Reflecting on our design process, we identify five driving forces in atlas design -- data, people, stories, context, and the atlas itself -- whose shifting dynamics influence different stages of visualization atlas design in different ways. Grounded in our case study, we discuss using these forces as a conceptual starting point for structuring and reflecting on future atlas design processes.
Open → 2604.20781v1
SWE-chat: Coding Agent Interactions From Real Users in the Wild
2026-04-22Artificial IntelligenceComputers and SocietySoftware Engineeringarxiv
Abstract
AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
Open → 2604.20779v1
Efficient Multi-Cohort Inference for Long-Term Effects and Lifetime Val…
2026-04-22Machine Learningarxiv
Abstract
In streaming platforms churn is extremely costly, yet A/B tests are typically evaluated using outcomes observed within a limited experimental horizon. Even when both short- and predicted long-term engagement metrics are considered, they may fail to capture how a treatment affects users' retention. Consequently, an intervention may appear beneficial in the short term and neutral in the long term while still generating lower total value than the control due to users churn. To address this limitation, we introduce a method that estimates long-term treatment effects (LTE) and residual lifetime value change ($ΔERLV$) in short multi-cohort A/B tests under user learning. To estimate time-varying treatment effects efficiently, we introduce an inverse-variance weighted estimator that combines multiple cohorts estimates, reducing variance relative to standard approaches in the literature. The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time. Our framework enables simultaneous evaluation of steady-state impact and residual user value within a single experiment. Empirical results show improved precision in estimating LTE and $ΔERLV$ and identify scenarios in which relying on either short-term or long-term metrics alone would lead to incorrect product decisions.
Open → 2604.20777v1