This Week In Computer Science Papers
Week beginning 15th December 2025
Tap a tile to open details. Use the left sidebar to filter by category.
No filters applied
Showing 1–36 of 345
DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decod…
2025-12-15Computer Vision and Pattern RecognitionArtificial IntelligenceGraphicsarxiv
Abstract
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
Open → 2512.13690v1
LitePT: Lighter Yet Stronger Point Transformer
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.
Open → 2512.13689v1
Towards Scalable Pre-training of Visual Tokenizers for Generation
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.
Open → 2512.13687v1
Lyra: A Hardware-Accelerated RISC-V Verification Framework with Generat…
2025-12-15Hardware Architecturearxiv
Abstract
As processor designs grow more complex, verification remains bottlenecked by slow software simulation and low-quality random test stimuli. Recent research has applied software fuzzers to hardware verification, but these rely on semantically blind random mutations that may generate shallow, low-quality stimuli unable to explore complex behaviors. These limitations result in slow coverage convergence and prohibitively high verification costs. In this paper, we present Lyra, a heterogeneous RISC-V verification framework that addresses both challenges by pairing hardware-accelerated verification with an ISA-aware generative model. Lyra executes the DUT and reference model concurrently on an FPGA SoC, enabling high-throughput differential checking and hardware-level coverage collection. Instead of creating verification stimuli randomly or through simple mutations, we train a domain-specialized generative model, LyraGen, with inherent semantic awareness to generate high-quality, semantically rich instruction sequences. Empirical results show Lyra achieves up to $1.27\times$ higher coverage and accelerates end-to-end verification by up to $107\times$ to $3343\times$ compared to state-of-the-art software fuzzers, while consistently demonstrating lower convergence difficulty.
Open → 2512.13686v1
Beyond surface form: A pipeline for semantic analysis in Alzheimer's Di…
2025-12-15Computation and Languagearxiv
Abstract
Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
Open → 2512.13685v1
Recurrent Video Masked Autoencoders
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
Open → 2512.13684v1
I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/
Open → 2512.13683v1
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Recons…
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
Open → 2512.13680v1
Feedforward 3D Editing via Text-Steerable Image-to-3D
2025-12-15Computer Vision and Pattern RecognitionArtificial Intelligencearxiv
Abstract
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/
Open → 2512.13678v1
JoVA: Unified Multimodal Learning for Joint Video-Audio Generation
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
Open → 2512.13677v1
Towards Effective Model Editing for LLM Personalization
2025-12-15Computation and Languagearxiv
Abstract
Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
Open → 2512.13676v1
Towards Interactive Intelligence for Digital Humans
2025-12-15Computer Vision and Pattern RecognitionComputation and LanguageGraphicsarxiv
Abstract
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
Open → 2512.13674v1
Directional Textual Inversion for Personalized Text-to-Image Generation
2025-12-15Machine LearningComputer Vision and Pattern Recognitionarxiv
Abstract
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
Open → 2512.13672v1
AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
Open → 2512.13671v1
NL2SpaTiaL: Generating Geometric Spatio-Temporal Logic Specifications f…
2025-12-15Roboticsarxiv
Abstract
Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: https://sites.google.com/view/nl2spatial
Open → 2512.13670v1
A Scientific Reasoning Model for Organic Synthesis Procedure Generation
2025-12-15Machine Learningarxiv
Abstract
Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.
Open → 2512.13668v1
A stylometric analysis of speaker attribution from speech transcripts
2025-12-15Computation and Languagearxiv
Abstract
Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
Open → 2512.13667v1
SEDULity: A Proof-of-Learning Framework for Distributed and Secure Bloc…
2025-12-15Cryptography and SecurityDistributed, Parallel, and Cluster ComputingInformation Theoryarxiv
Abstract
The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.
Open → 2512.13666v1
Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consi…
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
Open → 2512.13665v1
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language M…
2025-12-15RoboticsComputer Vision and Pattern Recognitionarxiv
Abstract
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
Open → 2512.13660v1
Embedding-Based Rankings of Educational Resources based on Learning Out…
2025-12-15Computers and SocietyArtificial Intelligencearxiv
Abstract
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
Open → 2512.13658v1
Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture…
2025-12-15Computation and LanguageSoftware Engineeringarxiv
Abstract
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
Open → 2512.13655v1
Large-Language Memorization During the Classification of United States…
2025-12-15Computation and LanguageArtificial IntelligenceEmerging Technologiesarxiv
Abstract
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
Open → 2512.13654v1
World Models Can Leverage Human Videos for Dexterous Manipulation
2025-12-15RoboticsArtificial IntelligenceComputer Vision and Pattern Recognitionarxiv
Abstract
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
Open → 2512.13644v1
Follow Nudges without Budges: A Field Experiment on Misinformation Foll…
2025-12-15Social and Information Networksarxiv
Abstract
Can digital ads encourage users exposed to inaccurate information sources to follow accurate ones? We conduct a large-scale field experiment (N=28,582) on X, formerly Twitter, with users who follow accounts that spread health misinformation. Participants were exposed to four ad treatments varied on two dimensions: a neutral message versus a persuasive message appealing to values of independence, and a request to follow a health institution versus a request to follow a health influencer. We term this ad-based, social network intervention a follow nudge. The ad with a persuasive message to follow a well-known health institution generated significantly higher click-through rates than all other conditions (Bonferroni-corrected pairwise tests, all p<0.001). Given the overall low click-through rate across treatments and the high cost of digital advertising infrastructure on X, however, we conclude that our proposed intervention -- at least in its current ad-based format -- is not a cost-effective means to improve information environments online. We discuss challenges faced when conducting large-scale experiments on X following the platform's ownership change and subsequent restrictions on data access for research purposes.
Open → 2512.13643v1
From Code to Field: Evaluating the Robustness of Convolutional Neural N…
2025-12-15Machine LearningArtificial IntelligenceComputer Vision and Pattern Recognitionarxiv
Abstract
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
Open → 2512.13641v1
Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to B…
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
Open → 2512.13639v1
Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accel…
2025-12-15Distributed, Parallel, and Cluster ComputingHardware Architecturearxiv
Abstract
Tile-based many-Processing Element (PE) accelerators can achieve competitive performance on General Matrix Multiplication (GEMM), but they are extremely hard to program, as their optimal software mapping is deeply coupled with hardware design which is unwieldy to manual deployment. We propose "Design in Tiles (DiT)", an automated framework connecting a deployment toolchain with a configurable executable model for these accelerators. For evaluation, we apply our framework to GEMM targeting a large acceleration configuration (e.g., 32x32 tiles, 1979 TFLOPS@FP8, 4 TB/s Bandwidth) comparable to an NVIDIA GH200. We achieve higher PE utilization than GH200 with its expert-tuned GEMM libraries, achieving 1.2-2.0x speedup across diverse matrix shapes.
Open → 2512.13638v1
MindDrive: A Vision-Language-Action Model for Autonomous Driving via On…
2025-12-15Computer Vision and Pattern RecognitionRoboticsarxiv
Abstract
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
Open → 2512.13636v1
SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient…
2025-12-15Computer Vision and Pattern Recognitionarxiv
Abstract
Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST
Open → 2512.13635v1
Universality of high-dimensional scaling limits of stochastic gradient…
2025-12-15Machine Learningarxiv
Abstract
We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.
Open → 2512.13634v1
StutterFuse: Mitigating Modality Collapse in Stuttering Detection with…
2025-12-15Machine Learningarxiv
Abstract
Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.
Open → 2512.13632v1
Certified-Everlasting Quantum NIZK Proofs
2025-12-15Cryptography and Securityarxiv
Abstract
We study non-interactive zero-knowledge proofs (NIZKs) for NP satisfying: 1) statistical soundness, 2) computational zero-knowledge and 3) certified-everlasting zero-knowledge (CE-ZK). The CE-ZK property allows a verifier of a quantum proof to revoke the proof in a way that can be checked (certified) by the prover. Conditioned on successful certification, the verifier's state can be efficiently simulated with only the statement, in a statistically indistinguishable way. Our contributions regarding these certified-everlasting NIZKs (CE-NIZKs) are as follows: - We identify a barrier to obtaining CE-NIZKs in the CRS model via generalizations of known interactive proofs that satisfy CE-ZK. - We circumvent this by constructing CE-NIZK from black-box use of NIZK for NP satisfying certain properties, along with OWFs. As a result, we obtain CE-NIZKs for NP in the CRS model, based on polynomial hardness of the learning with errors (LWE) assumption. - In addition, we observe that the aforementioned barrier does not apply to the shared EPR model. Consequently, we present a CE-NIZK for NP in this model based on any statistical binding hidden-bits generator, which can be based on LWE. The only quantum computation in this protocol involves single-qubit measurements of the shared EPR pairs.
Open → 2512.13628v1
Preconditioning Techniques for Hybridizable Discontinuous Galerkin Disc…
2025-12-15Computational Engineering, Finance, and Sciencearxiv
Abstract
We present scalable iterative solvers and preconditioning strategies for Hybridizable Discontinuous Galerkin (HDG) discretizations of partial differential equations (PDEs) on graphics processing units (GPUs). The HDG method is implemented using GPU-tailored algorithms in which local element degrees of freedom are eliminated in parallel, and the globally condensed system is assembled directly on the device using dense-block operations. The global matrix is stored in a block format that reflects the natural HDG structure, enabling all iterative solver kernels to be executed with strided batched dense matrix-vector multiplications. This implementation avoids sparse data structures, increases arithmetic intensity, and sustains high memory throughput across a range of meshes and polynomial orders. The nonlinear solver combines Newton's method with preconditioned GMRES, integrating scalable preconditioners such as block-Jacobi, additive Schwarz domain decomposition, and polynomial smoothers. All preconditioners are implemented in batched form with architecture-aware optimizations--including dense linear algebra kernels, memory-coalesced vector operations, and shared-memory acceleration--to minimize memory traffic and maximize parallel occupancy. Comprehensive studies are conducted for a variety of PDEs (including Poisson equation, Burgers equation, linear and nonlinear elasticity, Euler equations, Navier-Stokes equations, and Reynolds-Averaged Navier-Stokes equations) using structured and unstructured meshes with different element types and polynomial orders on both NVIDIA and AMD GPU architectures.
Open → 2512.13619v1
Temporal Tokenization Strategies for Event Sequence Modeling with Large…
2025-12-15Computation and LanguageMachine Learningarxiv
Abstract
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
Open → 2512.13618v1
LightTopoGAT: Enhancing Graph Attention Networks with Topological Featu…
2025-12-15Machine Learningarxiv
Abstract
Graph Neural Networks have demonstrated significant success in graph classification tasks, yet they often require substantial computational resources and struggle to capture global graph properties effectively. We introduce LightTopoGAT, a lightweight graph attention network that enhances node features through topological augmentation by incorporating node degree and local clustering coefficient to improve graph representation learning. The proposed approach maintains parameter efficiency through streamlined attention mechanisms while integrating structural information that is typically overlooked by local message passing schemes. Through comprehensive experiments on three benchmark datasets, MUTAG, ENZYMES, and PROTEINS, we show that LightTopoGAT achieves superior performance compared to established baselines including GCN, GraphSAGE, and standard GAT, with a 6.6 percent improvement in accuracy on MUTAG and a 2.2 percent improvement on PROTEINS. Ablation studies further confirm that these performance gains arise directly from the inclusion of topological features, demonstrating a simple yet effective strategy for enhancing graph neural network performance without increasing architectural complexity.
Open → 2512.13617v1