arXiv Daily Digest

138

Papers

CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation

CLOT：用于全身类人机器人遥操作的闭环全局运动跟踪

Zhu, Tengjie, Cai, Guanyu, Zhaohui, Yang, Ren, Guanzhu, Xie, Haohui, Wang, ZiRui, Wu, Junsong, Wang, Jingbo, Yang, Xiaokang, Mu, Yao, Yan, Yichao, Yan, Yichao

Abstract

Long-horizon whole-body humanoid teleoperation remains challenging due to accumulated global pose drift, particularly on full-sized humanoids. Although recent learning-based tracking methods enable agile and coordinated motions, they typically operate in the robot's local frame and neglect global pose feedback, leading to drift and instability during extended execution. In this work, we present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking via high-frequency localization feedback. CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons. However, directly imposing global tracking rewards in reinforcement learning, often results in aggressive and brittle corrections. To address this, we propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections. We further regularize the policy with an adversarial motion prior to suppress unnatural behaviors. To support CLOT, we collect 20 hours of carefully curated human motion data for training the humanoid teleoperation policy. We design a transformer-based policy and train it for over 1300 GPU hours. The policy is deployed on a full-sized humanoid with 31 DoF (excluding hands). Both simulation and real-world experiments verify high-dynamic motion, high-precision tracking, and strong robustness in sim-to-real humanoid teleoperation. Motion data, demos and code can be found in our website.

Chinese Translation

长时间的全身类人机器人遥操作仍然面临挑战，尤其是在全尺寸类人机器人上，由于全局姿态漂移的累积。尽管最近的基于学习的跟踪方法能够实现灵活和协调的运动，但它们通常在机器人的局部坐标系中运行，忽视了全局姿态反馈，导致在长时间执行过程中出现漂移和不稳定。在本研究中，我们提出了CLOT，一个实时的全身类人机器人遥操作系统，通过高频率的定位反馈实现闭环全局运动跟踪。CLOT在闭环中同步操作员和机器人姿态，使得人类与类人机器人之间的模仿在长时间范围内无漂移。然而，直接在强化学习中施加全局跟踪奖励，往往会导致激进和脆弱的修正。为了解决这个问题，我们提出了一种数据驱动的随机化策略，将观察轨迹与奖励评估解耦，从而实现平滑和稳定的全局修正。我们进一步通过对抗性运动先验来正则化策略，以抑制不自然的行为。为了支持CLOT，我们收集了20小时精心策划的人类运动数据，用于训练类人机器人遥操作策略。我们设计了一种基于变换器的策略，并进行了超过1300小时的GPU训练。该策略在一个具有31个自由度（不包括手部）的全尺寸类人机器人上部署。仿真和现实世界实验验证了在类人机器人遥操作中的高动态运动、高精度跟踪和强鲁棒性。运动数据、演示和代码可以在我们的网站上找到。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.15061

Safe-SDL:Establishing Safety Boundaries and Control Mechanisms for AI-Driven Self-Driving Laboratories

安全自驾实验室（Safe-SDL）：为人工智能驱动的自驾实验室建立安全边界和控制机制

Zhang, Zihan, Que, Haohui, Chang, Junhan, Zhang, Xin, Wei, Hao, Zhu, Tong

Abstract

The emergence of Self-Driving Laboratories (SDLs) transforms scientific discovery methodology by integrating AI with robotic automation to create closed-loop experimental systems capable of autonomous hypothesis generation, experimentation, and analysis. While promising to compress research timelines from years to weeks, their deployment introduces unprecedented safety challenges differing from traditional laboratories or purely digital AI. This paper presents Safe-SDL, a comprehensive framework for establishing robust safety boundaries and control mechanisms in AI-driven autonomous laboratories. We identify and analyze the critical ``Syntax-to-Safety Gap'' -- the disconnect between AI-generated syntactically correct commands and their physical safety implications -- as the central challenge in SDL deployment. Our framework addresses this gap through three synergistic components: (1) formally defined Operational Design Domains (ODDs) that constrain system behavior within mathematically verified boundaries, (2) Control Barrier Functions (CBFs) that provide real-time safety guarantees through continuous state-space monitoring, and (3) a novel Transactional Safety Protocol (CRUTD) that ensures atomic consistency between digital planning and physical execution. We ground our theoretical contributions through analysis of existing implementations including UniLabOS and the Osprey architecture, demonstrating how these systems instantiate key safety principles. Evaluation against the LabSafety Bench reveals that current foundation models exhibit significant safety failures, demonstrating that architectural safety mechanisms are essential rather than optional. Our framework provides both theoretical foundations and practical implementation guidance for safe deployment of autonomous scientific systems, establishing the groundwork for responsible acceleration of AI-driven discovery.

Chinese Translation

自驾实验室（Self-Driving Laboratories, SDLs）的出现通过将人工智能与机器人自动化相结合，改变了科学发现的方法论，创造了能够自主生成假设、进行实验和分析的闭环实验系统。尽管其承诺将研究时间从数年压缩至数周，但其部署引入了与传统实验室或纯数字人工智能不同的前所未有的安全挑战。本文提出了安全自驾实验室（Safe-SDL），这是一个为人工智能驱动的自主实验室建立稳健安全边界和控制机制的综合框架。我们识别并分析了关键的“语法到安全的差距”（Syntax-to-Safety Gap）——即人工智能生成的语法正确命令与其物理安全含义之间的脱节——作为SDL部署中的核心挑战。我们的框架通过三个协同组件来解决这一差距：(1) 正式定义的操作设计领域（Operational Design Domains, ODDs），在数学验证的边界内约束系统行为；(2) 控制障碍函数（Control Barrier Functions, CBFs），通过持续的状态空间监控提供实时安全保障；(3) 一种新颖的事务安全协议（Transactional Safety Protocol, CRUTD），确保数字规划与物理执行之间的原子一致性。我们通过对现有实现的分析，包括UniLabOS和Osprey架构，来支撑我们的理论贡献，展示这些系统如何体现关键的安全原则。针对LabSafety Bench的评估表明，当前的基础模型表现出显著的安全失败，证明了架构安全机制是必不可少的，而非可选的。我们的框架为安全部署自主科学系统提供了理论基础和实践实施指导，为负责任地加速人工智能驱动的发现奠定了基础。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.15063

How Do We Research Human-Robot Interaction in the Age of Large Language Models? A Systematic Review

在大型语言模型时代，我们如何研究人机交互？一项系统性综述

Wang, Yufeng, Xu, Yuan, Nikolova, Anastasia, Wang, Yuxuan, Wang, Jianyu, Wang, Chongyang, Tong, Xin

Abstract

Advances in large language models (LLMs) are profoundly reshaping the field of human-robot interaction (HRI). While prior work has highlighted the technical potential of LLMs, few studies have systematically examined their human-centered impact (e.g., human-oriented understanding, user modeling, and levels of autonomy), making it difficult to consolidate emerging challenges in LLM-driven HRI systems. Therefore, we conducted a systematic literature search following the PRISMA guideline, identifying 86 articles that met our inclusion criteria. Our findings reveal that: (1) LLMs are transforming the fundamentals of HRI by reshaping how robots sense context, generate socially grounded interactions, and maintain continuous alignment with human needs in embodied settings; and (2) current research is largely exploratory, with different studies focusing on different facets of LLM-driven HRI, resulting in wide-ranging choices of experimental setups, study methods, and evaluation metrics. Finally, we identify key design considerations and challenges, offering a coherent overview and guidelines for future research at the intersection of LLMs and HRI.

Chinese Translation

大型语言模型（LLMs）的进步正在深刻重塑人机交互（HRI）领域。尽管先前的研究强调了LLMs的技术潜力，但很少有研究系统性地考察其以人为中心的影响（例如，以人为本的理解、用户建模和自主性水平），这使得整合LLM驱动的HRI系统中出现的新挑战变得困难。因此，我们遵循PRISMA指南进行了系统的文献检索，识别出86篇符合我们纳入标准的文章。我们的研究结果显示：（1）LLMs正在通过重塑机器人感知上下文、生成社会化互动以及在具身环境中与人类需求保持持续一致性，转变HRI的基础；（2）当前的研究主要是探索性的，不同的研究集中在LLM驱动的HRI的不同方面，导致实验设置、研究方法和评估指标的选择差异很大。最后，我们识别出关键的设计考虑因素和挑战，为未来在LLMs与HRI交叉领域的研究提供了连贯的概述和指导。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.15092

Augmenting Human Balance with Generic Supernumerary Robotic Limbs

通过通用超数机器人肢体增强人类平衡

Qiu, Xuanyun, Verdel, Dorian, Cervantes-Culebro, Hector, Devillard, Alexis, Burdet, Etienne

Abstract

Supernumerary robotic limbs (SLs) have the potential to transform a wide range of human activities, yet their usability remains limited by key technical challenges, particularly in ensuring safety and achieving versatile control. Here, we address the critical problem of maintaining balance in the human-SLs system, a prerequisite for safe and comfortable augmentation tasks. Unlike previous approaches that developed SLs specifically for stability support, we propose a general framework for preserving balance with SLs designed for generic use. Our hierarchical three-layer architecture consists of: (i) a prediction layer that estimates human trunk and center of mass (CoM) dynamics, (ii) a planning layer that generates optimal CoM trajectories to counteract trunk movements and computes the corresponding SL control inputs, and (iii) a control layer that executes these inputs on the SL hardware. We evaluated the framework with ten participants performing forward and lateral bending tasks. The results show a clear reduction in stance instability, demonstrating the framework's effectiveness in enhancing balance. This work paves the path towards safe and versatile human-SLs interactions. [This paper has been submitted for publication to IEEE.]

Chinese Translation

超数机器人肢体（SLs）有潜力改变广泛的人类活动，但其可用性仍受到关键技术挑战的限制，特别是在确保安全性和实现多功能控制方面。在此，我们解决了人类与SLs系统中维持平衡的关键问题，这是安全和舒适增强任务的前提。与以往专门为稳定性支持开发SLs的方法不同，我们提出了一个通用框架，用于利用设计为通用用途的SLs保持平衡。我们的分层三层架构包括：（i）预测层，用于估计人类躯干和质心（CoM）动态；（ii）规划层，生成最佳的CoM轨迹以抵消躯干运动，并计算相应的SL控制输入；（iii）控制层，在SL硬件上执行这些输入。我们通过十名参与者进行前屈和侧屈任务评估了该框架。结果显示站立不稳定性明显降低，证明了该框架在增强平衡方面的有效性。这项工作为安全和多功能的人类与SLs交互铺平了道路。[本文已提交至IEEE发表。]

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.15162

A ROS2 Benchmarking Framework for Hierarchical Control Strategies in Mobile Robots for Mediterranean Greenhouses

用于地中海温室移动机器人分层控制策略的ROS2基准测试框架

Cañadas-Aránega, Fernando, Mañas-Álvarez, Francisco J., Guzmán, José L-, Moreno, José C., Blanco-Claraco, José L.

Abstract

Mobile robots operating in agroindustrial environments, such as Mediterranean greenhouses, are subject to challenging conditions, including uneven terrain, variable friction, payload changes, and terrain slopes, all of which significantly affect control performance and stability. Despite the increasing adoption of robotic platforms in agriculture, the lack of standardized, reproducible benchmarks impedes fair comparisons and systematic evaluations of control strategies under realistic operating conditions. This paper presents a comprehensive benchmarking framework for evaluating mobile robot controllers in greenhouse environments. The proposed framework integrates an accurate three dimensional model of the environment, a physics based simulator, and a hierarchical control architecture comprising low, mid, and high level control layers. Three benchmark categories are defined to enable modular assessment, ranging from actuator level control to full autonomous navigation. Additionally, three disturbance scenarios payload variation, terrain type, and slope are explicitly modeled to replicate real world agricultural conditions. To ensure objective and reproducible evaluation, standardized performance metrics are introduced, including the Squared Absolute Error (SAE), the Squared Control Input (SCI), and composite performance indices. Statistical analysis based on repeated trials is employed to mitigate the influence of sensor noise and environmental variability. The framework is further enhanced by a plugin based architecture that facilitates seamless integration of user defined controllers and planners. The proposed benchmark provides a robust and extensible tool for the quantitative comparison of classical, predictive, and planning based control strategies in realistic conditions, bridging the gap between simulation based analysis and real world agroindustrial applications.

Chinese Translation

在农业工业环境中运行的移动机器人，如地中海温室，面临着不平坦的地形、可变摩擦、载荷变化和地形坡度等挑战性条件，这些因素显著影响控制性能和稳定性。尽管农业中对机器人平台的采用日益增加，但缺乏标准化和可重复的基准测试妨碍了在现实操作条件下对控制策略的公平比较和系统评估。本文提出了一个全面的基准测试框架，用于评估温室环境中的移动机器人控制器。该框架集成了环境的精确三维模型、基于物理的模拟器以及包含低、中和高层控制层的分层控制架构。定义了三个基准类别，以实现模块化评估，从执行器级控制到完全自主导航。此外，明确建模了三种干扰场景：载荷变化、地形类型和坡度，以复制现实世界的农业条件。为了确保客观和可重复的评估，引入了标准化的性能指标，包括平方绝对误差（Squared Absolute Error, SAE）、平方控制输入（Squared Control Input, SCI）和复合性能指数。采用基于重复试验的统计分析来减轻传感器噪声和环境变异性的影响。该框架通过插件架构进一步增强，便于无缝集成用户定义的控制器和规划器。所提出的基准测试为在现实条件下对经典、预测和基于规划的控制策略进行定量比较提供了一个强大且可扩展的工具，弥合了基于模拟的分析与现实世界农业工业应用之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.15201

DexEvolve: Evolutionary Optimization for Robust and Diverse Dexterous Grasp Synthesis

DexEvolve：用于鲁棒且多样化灵巧抓取合成的进化优化

Zurbrügg, René, Cramariuc, Andrei, Hutter, Marco

Abstract

Dexterous grasping is fundamental to robotics, yet data-driven grasp prediction heavily relies on large, diverse datasets that are costly to generate and typically limited to a narrow set of gripper morphologies. Analytical grasp synthesis can be used to scale data collection, but necessary simplifying assumptions often yield physically infeasible grasps that need to be filtered in high-fidelity simulators, significantly reducing the total number of grasps and their diversity. We propose a scalable generate-and-refine pipeline for synthesizing large-scale, diverse, and physically feasible grasps. Instead of using high-fidelity simulators solely for verification and filtering, we leverage them as an optimization stage that continuously improves grasp quality without discarding precomputed candidates. More specifically, we initialize an evolutionary search with a seed set of analytically generated, potentially suboptimal grasps. We then refine these proposals directly in a high-fidelity simulator (Isaac Sim) using an asynchronous, gradient-free evolutionary algorithm, improving stability while maintaining diversity. In addition, this refinement stage can be guided toward human preferences and/or domain-specific quality metrics without requiring a differentiable objective. We further distill the refined grasp distribution into a diffusion model for robust real-world deployment, and highlight the role of diversity for both effective training and during deployment. Experiments on a newly introduced Handles dataset and a DexGraspNet subset demonstrate that our approach achieves over 120 distinct stable grasps per object (a 1.7-6x improvement over unrefined analytical methods) while outperforming diffusion-based alternatives by 46-60\% in unique grasp coverage.

Chinese Translation

灵巧抓取是机器人技术的基础，但基于数据的抓取预测在很大程度上依赖于生成成本高昂且通常仅限于狭窄抓取器形态的数据集。分析性抓取合成可以用于扩展数据收集，但必要的简化假设往往会导致物理上不可行的抓取，这需要在高保真模拟器中进行过滤，从而显著减少抓取的总数及其多样性。我们提出了一种可扩展的生成与精炼管道，用于合成大规模、多样化和物理上可行的抓取。我们不仅将高保真模拟器用于验证和过滤，还将其作为一个优化阶段，持续改进抓取质量，而不丢弃预计算的候选抓取。更具体地说，我们以一组分析生成的、可能次优的抓取作为种子集初始化进化搜索。然后，我们使用异步、无梯度的进化算法在高保真模拟器（Isaac Sim）中直接精炼这些提案，提高稳定性，同时保持多样性。此外，这一精炼阶段可以在不需要可微分目标的情况下，朝向人类偏好和/或特定领域的质量指标进行引导。我们进一步将精炼后的抓取分布提炼为一个扩散模型，以便于在现实世界中的鲁棒部署，并强调多样性在有效训练和部署过程中的重要性。在新引入的Handles数据集和DexGraspNet子集上的实验表明，我们的方法每个物体实现了超过120个独特的稳定抓取（相比未精炼的分析方法提高了1.7-6倍），同时在独特抓取覆盖率上超越了基于扩散的替代方法46-60%。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.15258

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

SEG-JPEG：用于不可靠无线网络下自动驾驶车辆远程操作的简单视觉语义通信

Donnelly, Sebastian, Anderson, Ruth, Economides, George, Broughton, James, Ball, Peter, Rast, Alexander, Bradley, Andrew

Abstract

Remote Operation is touted as being key to the rapid deployment of automated vehicles. Streaming imagery to control connected vehicles remotely currently requires a reliable, high throughput network connection, which can be limited in real-world remote operation deployments relying on public network infrastructure. This paper investigates how the application of computer vision assisted semantic communication can be used to circumvent data loss and corruption associated with traditional image compression techniques. By encoding the segmentations of detected road users into colour coded highlights within low resolution greyscale imagery, the required data rate can be reduced by 50 \% compared with conventional techniques, while maintaining visual clarity. This enables a median glass-to-glass latency of below 200ms even when the network data rate is below 500kbit/s, while clearly outlining salient road users to enhance situational awareness of the remote operator. The approach is demonstrated in an area of variable 4G mobile connectivity using an automated last-mile delivery vehicle. With this technique, the results indicate that large-scale deployment of remotely operated automated vehicles could be possible even on the often constrained public 4G/5G mobile network, providing the potential to expedite the nationwide roll-out of automated vehicles.

Chinese Translation

远程操作被认为是自动驾驶车辆快速部署的关键。目前，远程控制连接车辆所需的图像流传输需要可靠的高吞吐量网络连接，而在依赖公共网络基础设施的实际远程操作部署中，这种连接可能受到限制。本文研究了如何应用计算机视觉辅助的语义通信来规避传统图像压缩技术所带来的数据丢失和损坏。通过将检测到的道路使用者的分割信息编码为低分辨率灰度图像中的彩色高亮，所需的数据传输速率可以比传统技术降低50%，同时保持视觉清晰度。这使得即使在网络数据速率低于500kbit/s的情况下，玻璃到玻璃的中位延迟也可以低于200毫秒，同时清晰地勾勒出显著的道路使用者，以增强远程操作员的情境意识。该方法在4G移动连接不稳定的区域中使用自动化最后一公里配送车辆进行了演示。通过这种技术，结果表明，即使在常常受限的公共4G/5G移动网络上，远程操作的自动驾驶车辆的大规模部署也是可能的，从而有潜力加速自动驾驶车辆的全国推广。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.15309

OSCAR: An Ovipositor-Inspired Self-Propelling Capsule Robot for Colonoscopy

OSCAR：一种受卵产器启发的自推进胶囊机器人用于结肠镜检查

Atalla, Mostafa A., Sekar, Anand S., van Starkenburg, Remi, Jager, David J., Sakes, Aimée, Wiertlewski, Michaël, Breedveld, Paul

Abstract

Self-propelling robotic capsules eliminate shaft looping of conventional colonoscopy, reducing patient discomfort. However, reliably moving within the slippery, viscoelastic environment of the colon remains a significant challenge. We present OSCAR, an ovipositor-inspired self-propelling capsule robot that translates the transport strategy of parasitic wasps into a propulsion mechanism for colonoscopy. OSCAR mechanically encodes the ovipositor-inspired motion pattern through a spring-loaded cam system that drives twelve circumferential sliders in a coordinated, phase-shifted sequence. By tuning the motion profile to maximize the retract phase relative to the advance phase, the capsule creates a controlled friction anisotropy at the interface that generates net forward thrust. We developed an analytical model incorporating a Kelvin-Voigt formulation to capture the viscoelastic stick--slip interactions between the sliders and the tissue, linking the asymmetry between advance and retract phase durations to mean thrust, and slider-reversal synchronization to thrust stability. Comprehensive force characterization experiments in ex-vivo porcine colon revealed a mean steady-state traction force of 0.85 N, closely matching the model. Furthermore, experiments confirmed that thrust generation is speed-independent and scales linearly with the phase asymmetry, in agreement with theoretical predictions, underscoring the capsule's predictable performance and scalability. In locomotion validation experiments, OSCAR demonstrated robust performance, achieving an average speed of 3.08 mm/s, a velocity sufficient to match the cecal intubation times of conventional colonoscopy. By coupling phase-encoded friction anisotropy with a predictive model, OSCAR delivers controllable thrust generation at low normal loads, enabling safer and more robust self-propelling locomotion for robotic capsule colonoscopy.

Chinese Translation

自推进机器人胶囊消除了传统结肠镜检查中的轴环绕现象，从而减少了患者的不适。然而，在结肠滑腻的粘弹性环境中可靠地移动仍然是一个重大挑战。我们提出了OSCAR，一种受卵产器启发的自推进胶囊机器人，将寄生蜂的运输策略转化为结肠镜检查的推进机制。OSCAR通过一个弹簧加载凸轮系统机械编码了受卵产器启发的运动模式，该系统以协调的相位偏移序列驱动十二个周向滑块。通过调整运动轮廓以最大化收回阶段相对于推进阶段的时间，胶囊在界面上产生可控的摩擦各向异性，从而产生净向前推力。我们开发了一个包含Kelvin-Voigt模型的分析模型，以捕捉滑块与组织之间的粘弹性粘滑相互作用，将推进阶段和收回阶段持续时间之间的不对称性与平均推力联系起来，并将滑块反转同步与推力稳定性关联。对离体猪结肠的全面力学特性实验显示，平均稳态牵引力为0.85 N，接近模型预测。此外，实验确认推力生成与速度无关，并与相位不对称性线性相关，这与理论预测一致，强调了胶囊的可预测性能和可扩展性。在运动验证实验中，OSCAR表现出强大的性能，平均速度达到3.08 mm/s，足以匹配传统结肠镜检查的盲肠插管时间。通过将相位编码的摩擦各向异性与预测模型相结合，OSCAR在低法向载荷下实现了可控的推力生成，从而为机器人胶囊结肠镜检查提供了更安全、更稳健的自推进运动。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.15351

Feasibility-aware Imitation Learning from Observation with Multimodal Feedback

考虑可行性的基于观察的模仿学习与多模态反馈

Takahashi, Kei, Sasaki, Hikaru, Matsubara, Takamitsu

Abstract

Imitation learning frameworks that learn robot control policies from demonstrators' motions via hand-mounted demonstration interfaces have attracted increasing attention. However, due to differences in physical characteristics between demonstrators and robots, this approach faces two limitations: i) the demonstration data do not include robot actions, and ii) the demonstrated motions may be infeasible for robots. These limitations make policy learning difficult. To address them, we propose Feasibility-Aware Behavior Cloning from Observation (FABCO). FABCO integrates behavior cloning from observation, which complements robot actions using robot dynamics models, with feasibility estimation. In feasibility estimation, the demonstrated motions are evaluated using a robot-dynamics model, learned from the robot's execution data, to assess reproducibility under the robot's dynamics. The estimated feasibility is used for multimodal feedback and feasibility-aware policy learning to improve the demonstrator's motions and learn robust policies. Multimodal feedback provides feasibility through the demonstrator's visual and haptic senses to promote feasible demonstrated motions. Feasibility-aware policy learning reduces the influence of demonstrated motions that are infeasible for robots, enabling the learning of policies that robots can execute stably. We conducted experiments with 15 participants on two tasks and confirmed that FABCO improves imitation learning performance by more than 3.2 times compared to the case without feasibility feedback.

Chinese Translation

通过手持演示接口从示范者的动作中学习机器人控制策略的模仿学习框架引起了越来越多的关注。然而，由于示范者与机器人之间在物理特性上的差异，这种方法面临两个限制：i) 演示数据不包括机器人的动作，ii) 演示的动作对于机器人而言可能不可行。这些限制使得策略学习变得困难。为了解决这些问题，我们提出了考虑可行性的基于观察的行为克隆（Feasibility-Aware Behavior Cloning from Observation，FABCO）。FABCO将基于观察的行为克隆与可行性评估相结合，利用机器人动力学模型补充机器人的动作。在可行性评估中，使用从机器人执行数据中学习的机器人动力学模型对演示的动作进行评估，以判断在机器人动力学下的可重现性。估计的可行性用于多模态反馈和考虑可行性的策略学习，以改善示范者的动作并学习稳健的策略。多模态反馈通过示范者的视觉和触觉感知提供可行性，以促进可行的演示动作。考虑可行性的策略学习减少了对机器人而言不可行的演示动作的影响，从而使得学习能够稳定执行的策略成为可能。我们对15名参与者进行了两项任务的实验，确认FABCO在模仿学习性能上比没有可行性反馈的情况提高了超过3.2倍。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.15354

A Comparison of Bayesian Prediction Techniques for Mobile Robot Trajectory Tracking

移动机器人轨迹跟踪的贝叶斯预测技术比较

Peralta-Cabezas, Jose Luis, Torres-Torriti, Miguel, Guarini-Hermann, Marcelo

Abstract

This paper presents a performance comparison of different estimation and prediction techniques applied to the problem of tracking multiple robots. The main performance criteria are the magnitude of the estimation or prediction error, the computational effort and the robustness of each method to non-Gaussian noise. Among the different techniques compared are the well known Kalman filters and their different variants (e.g. extended and unscented), and the more recent techniques relying on Sequential Monte Carlo Sampling methods, such as particle filters and Gaussian Mixture Sigma Point Particle Filter.

Chinese Translation

本文对应用于多机器人跟踪问题的不同估计和预测技术进行了性能比较。主要的性能标准包括估计或预测误差的大小、计算工作量以及每种方法对非高斯噪声的鲁棒性。比较的不同技术包括众所周知的卡尔曼滤波器及其不同变体（如扩展卡尔曼滤波器和无迹卡尔曼滤波器），以及依赖于序列蒙特卡洛采样方法的较新技术，如粒子滤波器和高斯混合西格玛点粒子滤波器。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.15357

Fluoroscopy-Constrained Magnetic Robot Control via Zernike-Based Field Modeling and Nonlinear MPC

基于Zernike多项式场建模和非线性模型预测控制的荧光透视约束磁性机器人控制

Chen, Xinhao, Yao, Hongkun, Bhattacharjee, Anuruddha, Raval, Suraj, Mair, Lamar O., Diaz-Mercado, Yancy, Krieger, Axel

Abstract

Magnetic actuation enables surgical robots to navigate complex anatomical pathways while reducing tissue trauma and improving surgical precision. However, clinical deployment is limited by the challenges of controlling such systems under fluoroscopic imaging, which provides low frame rate and noisy pose feedback. This paper presents a control framework that remains accurate and stable under such conditions by combining a nonlinear model predictive control (NMPC) framework that directly outputs coil currents, an analytically differentiable magnetic field model based on Zernike polynomials, and a Kalman filter to estimate the robot state. Experimental validation is conducted with two magnetic robots in a 3D-printed fluid workspace and a spine phantom replicating drug delivery in the epidural space. Results show the proposed control method remains highly accurate when feedback is downsampled to 3 Hz with added Gaussian noise (sigma = 2 mm), mimicking clinical fluoroscopy. In the spine phantom experiments, the proposed method successfully executed a drug delivery trajectory with a root mean square (RMS) position error of 1.18 mm while maintaining safe clearance from critical anatomical boundaries.

Chinese Translation

磁性驱动使外科机器人能够在复杂的解剖路径中导航，同时减少组织创伤并提高外科精度。然而，由于荧光透视成像提供的低帧率和噪声姿态反馈，临床应用受到限制。本文提出了一种控制框架，通过结合直接输出线圈电流的非线性模型预测控制（NMPC）框架、基于Zernike多项式的解析可微磁场模型以及用于估计机器人状态的卡尔曼滤波器，在这种条件下保持准确性和稳定性。实验验证在一个3D打印的流体工作空间和一个脊柱模型中进行，模拟药物在硬膜外空间的输送。结果表明，所提出的控制方法在反馈降采样至3 Hz并添加高斯噪声（σ = 2 mm）时仍保持高度准确，模拟临床荧光透视。在脊柱模型实验中，所提出的方法成功执行了药物输送轨迹，均方根（RMS）位置误差为1.18 mm，同时保持与关键解剖边界的安全间距。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.15397

ActionCodec: What Makes for Good Action Tokenizers

ActionCodec：优秀动作标记器的设计要素

Dong, Zibin, Liu, Yicheng, Zhang, Shiduo, Ye, Baijun, Yuan, Yifu, Ni, Fei, Gong, Jingjing, Qiu, Xipeng, Zhao, Hang, Li, Yinchuan, Hao, Jianye

Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.

Chinese Translation

利用视觉-语言-动作（VLA）模型的原生自回归范式，视觉-语言模型（VLMs）在指令跟随和训练效率方面表现出色。该范式的核心是动作标记化，但其设计主要集中在重建保真度上，未能解决其对VLA优化的直接影响。因此， extit{优秀动作标记器的设计要素}这一基本问题仍未得到解答。本文通过从VLA优化的角度建立设计原则来填补这一空白。我们基于信息论的见解识别出一系列最佳实践，包括最大化时间标记重叠、最小化词汇冗余、增强多模态互信息和标记独立性。在这些原则的指导下，我们提出了 extbf{ActionCodec}，一种高性能的动作标记器，显著提高了训练效率和VLA在各种模拟和现实世界基准测试中的表现。值得注意的是，在LIBERO上，使用ActionCodec微调的SmolVLM2-2.2B模型在没有任何机器人预训练的情况下达到了95.5\%的成功率。通过先进的架构增强，这一成功率提升至97.4\\%，代表了没有机器人预训练的VLA模型的新状态。我们相信，所建立的设计原则以及发布的模型将为社区提供清晰的路线图，以开发更有效的动作标记器。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.15398

Hybrid F' and ROS2 Architecture for Vision-Based Autonomous Flight: Design and Experimental Validation

基于视觉的自主飞行的混合 F' 和 ROS2 架构：设计与实验验证

Metwally, Abdelrahman, James, Monijesu, Fedoseev, Aleksey, Cabrera, Miguel Altamirano, Tsetserukou, Dzmitry, Somov, Andrey

Abstract

Autonomous aerospace systems require architectures that balance deterministic real-time control with advanced perception capabilities. This paper presents an integrated system combining NASA's F' flight software framework with ROS2 middleware via Protocol Buffers bridging. We evaluate the architecture through a 32.25-minute indoor quadrotor flight test using vision-based navigation. The vision system achieved 87.19 Hz position estimation with 99.90\% data continuity and 11.47 ms mean latency, validating real-time performance requirements. All 15 ground commands executed successfully with 100 % success rate, demonstrating robust F'--PX4 integration. System resource utilization remained low (15.19 % CPU, 1,244 MB RAM) with zero stale telemetry messages, confirming efficient operation on embedded platforms. Results validate the feasibility of hybrid flight-software architectures combining certification-grade determinism with flexible autonomy for autonomous aerial vehicles.

Chinese Translation

自主航空系统需要在确定性实时控制与先进感知能力之间取得平衡。本文提出了一种集成系统，将 NASA 的 F' 飞行软件框架与 ROS2 中间件通过协议缓冲区（Protocol Buffers）进行桥接。我们通过一次 32.25 分钟的室内四旋翼飞行测试评估该架构，该测试采用基于视觉的导航。视觉系统实现了 87.19 Hz 的位置估计，数据连续性达到 99.90%，平均延迟为 11.47 毫秒，验证了实时性能要求。所有 15 个地面指令均成功执行，成功率为 100%，展示了 F' 与 PX4 的强健集成。系统资源利用率保持在低水平（CPU 15.19%，RAM 1,244 MB），且没有过时的遥测消息，确认了在嵌入式平台上的高效运行。结果验证了将认证级确定性与灵活自主相结合的混合飞行软件架构在自主飞行器中的可行性。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.15400

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

一个代理引导他们所有人：通过明确的世界表示赋能多模态大型语言模型进行视觉与语言导航

Li, Zerui, Zheng, Hongpei, Zhao, Fangguo, Chan, Aidan, Zhou, Jian, Lin, Sihao, Li, Shijie, Wu, Qi

Abstract

A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.

Chinese Translation

一个可导航的代理需要理解高层次的语义指令和精确的空间感知。基于多模态大型语言模型（MLLMs）构建导航代理展示了一个有前景的解决方案，因为它们具有强大的泛化能力。然而，当前紧密耦合的设计显著限制了系统性能。在本研究中，我们提出了一种解耦设计，将低层次的空间状态估计与高层次的语义规划分开。与依赖于预定义的、过于简化的文本地图的先前方法不同，我们引入了一种交互式度量世界表示，保持丰富且一致的信息，使得MLLMs能够与其互动并进行推理以支持决策。此外，引入反事实推理进一步激发了MLLMs的能力，而度量世界表示确保了所产生动作的物理有效性。我们在模拟和现实环境中进行了全面实验。我们的方法建立了新的零样本最先进水平，在R2R-CE基准中实现了48.8%的成功率（SR），在RxR-CE基准中实现了42.2%。此外，为了验证我们度量表示的多样性，我们展示了在多种体现形式之间的零样本模拟到现实转移，包括轮式TurtleBot 4和定制的空中无人机。这些现实世界的部署验证了我们的解耦框架作为一个强健的、领域不变的接口，适用于具身视觉与语言导航。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.15424

Lyapunov-Based $\mathcal{L}_2$-Stable PI-Like Control of a Four-Wheel Independently Driven and Steered Robot

基于李雅普诺夫的 $ ext{L}_2$ 稳定 PI 类控制的四轮独立驱动和转向机器人

Ćaran, Branimir, Milić, Vladimir, Jerbić, Bojan

Abstract

In this letter, Lyapunov-based synthesis of a PI-like controller is proposed for $\mathcal{L}_2$-stable motion control of an independently driven and steered four-wheel mobile robot. An explicit, structurally verified model is used to enable systematic controller design with stability and performance guarantees suitable for real-time operation. A Lyapunov function is constructed to yield explicit bounds and $\mathcal{L}_2$ stability results, supporting feedback synthesis that reduces configuration dependent effects. The resulting control law maintains a PI-like form suitable for standard embedded implementation while preserving rigorous stability properties. Effectiveness and robustness are demonstrated experimentally on a real four-wheel mobile robot platform.

Chinese Translation

在本文中，提出了一种基于李雅普诺夫的 PI 类控制器合成方法，用于独立驱动和转向的四轮移动机器人进行 $ ext{L}_2$ 稳定运动控制。采用一种显式且结构上经过验证的模型，以实现系统化的控制器设计，并确保适合实时操作的稳定性和性能保证。构建了一个李雅普诺夫函数，以提供显式界限和 $ ext{L}_2$ 稳定性结果，支持反馈合成，减少配置依赖效应。所得控制律保持了适合标准嵌入式实现的 PI 类形式，同时保留了严格的稳定性特性。通过在真实的四轮移动机器人平台上的实验，验证了该方法的有效性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.15513

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

通过人类启发的记忆建模提升具身探索和问答中的多模态大语言模型

Li, Ji, Xia, Jing, Li, Mingyi, Hu, Shiyan

Abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

Chinese Translation

将多模态大语言模型作为具身智能体的大脑仍然面临挑战，尤其是在长时间观察和有限上下文预算的情况下。现有的记忆辅助方法通常依赖于文本摘要，这会丢失丰富的视觉和空间细节，并且在非平稳环境中表现脆弱。在本研究中，我们提出了一种非参数记忆框架，明确区分具身探索和问答中的情节记忆与语义记忆。我们的检索优先、推理辅助的范式通过语义相似性回忆情节经验，并通过视觉推理进行验证，从而实现过去观察的稳健重用，而无需严格的几何对齐。同时，我们引入了一种程序式规则提取机制，将经验转化为结构化、可重用的语义记忆，促进跨环境的泛化。大量实验表明，在具身问答和探索基准测试中，我们的方法实现了最先进的性能，在 A-EQA 上 LLM-Match 提升了 7.3%，LLM MatchXSPL 提升了 11.4%，在 GOAT-Bench 上成功率提升了 7.7%，SPL 提升了 6.8%。分析表明，我们的情节记忆主要提高了探索效率，而语义记忆则增强了具身智能体的复杂推理能力。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.15533

Efficient Knowledge Transfer for Jump-Starting Control Policy Learning of Multirotors through Physics-Aware Neural Architectures

通过物理感知神经架构高效知识转移以加速多旋翼控制策略学习

Rehberg, Welf, Kulkarni, Mihir, Weiss, Philipp, Alexis, Kostas

Abstract

Efficiently training control policies for robots is a major challenge that can greatly benefit from utilizing knowledge gained from training similar systems through cross-embodiment knowledge transfer. In this work, we focus on accelerating policy training using a library-based initialization scheme that enables effective knowledge transfer across multirotor configurations. By leveraging a physics-aware neural control architecture that combines a reinforcement learning-based controller and a supervised control allocation network, we enable the reuse of previously trained policies. To this end, we utilize a policy evaluation-based similarity measure that identifies suitable policies for initialization from a library. We demonstrate that this measure correlates with the reduction in environment interactions needed to reach target performance and is therefore suited for initialization. Extensive simulation and real-world experiments confirm that our control architecture achieves state-of-the-art control performance, and that our initialization scheme saves on average up to $73.5\%$ of environment interactions (compared to training a policy from scratch) across diverse quadrotor and hexarotor designs, paving the way for efficient cross-embodiment transfer in reinforcement learning.

Chinese Translation

高效训练机器人控制策略是一个主要挑战，利用通过跨体现知识转移从训练相似系统中获得的知识可以带来显著的好处。在本研究中，我们专注于使用基于库的初始化方案加速策略训练，该方案能够实现多旋翼配置之间的有效知识转移。通过利用一种物理感知的神经控制架构，该架构结合了基于强化学习的控制器和监督控制分配网络，我们实现了对先前训练策略的重用。为此，我们利用基于策略评估的相似性度量，从库中识别适合初始化的策略。我们证明了该度量与达到目标性能所需的环境交互减少之间存在相关性，因此适合用于初始化。广泛的仿真和实际实验确认我们的控制架构实现了最先进的控制性能，并且我们的初始化方案在不同的四旋翼和六旋翼设计中平均节省了高达73.5%的环境交互（与从头训练策略相比），为强化学习中的高效跨体现转移铺平了道路。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.15543

Selective Perception for Robot: Task-Aware Attention in Multimodal VLA

机器人选择性感知：多模态视觉-语言-动作中的任务感知注意力

Son, Young-Chae, Lee, Jung-Woo, Choi, Yoon-Ji, Ko, Dae-Kwan, Lim, Soo-Chul

Abstract

In robotics, Vision-Language-Action (VLA) models that integrate diverse multimodal signals from multi-view inputs have emerged as an effective approach. However, most prior work adopts static fusion that processes all visual inputs uniformly, which incurs unnecessary computational overhead and allows task-irrelevant background information to act as noise. Inspired by the principles of human active perception, we propose a dynamic information fusion framework designed to maximize the efficiency and robustness of VLA models. Our approach introduces a lightweight adaptive routing architecture that analyzes the current text prompt and observations from a wrist-mounted camera in real-time to predict the task-relevance of multiple camera views. By conditionally attenuating computations for views with low informational utility and selectively providing only essential visual features to the policy network, Our framework achieves computation efficiency proportional to task relevance. Furthermore, to efficiently secure large-scale annotation data for router training, we established an automated labeling pipeline utilizing Vision-Language Models (VLMs) to minimize data collection and annotation costs. Experimental results in real-world robotic manipulation scenarios demonstrate that the proposed approach achieves significant improvements in both inference efficiency and control performance compared to existing VLA models, validating the effectiveness and practicality of dynamic information fusion in resource-constrained, real-time robot control environments.

Chinese Translation

在机器人技术中，视觉-语言-动作（VLA）模型通过整合来自多视角输入的多样化多模态信号，已成为一种有效的方法。然而，大多数先前的研究采用静态融合，统一处理所有视觉输入，这导致了不必要的计算开销，并允许与任务无关的背景信息充当噪声。受到人类主动感知原则的启发，我们提出了一种动态信息融合框架，旨在最大化VLA模型的效率和鲁棒性。我们的方法引入了一种轻量级自适应路由架构，实时分析当前文本提示和来自手腕-mounted相机的观察，以预测多个相机视角的任务相关性。通过有条件地减弱低信息效用视角的计算，并选择性地仅向策略网络提供必要的视觉特征，我们的框架实现了与任务相关性成比例的计算效率。此外，为了高效地获取大规模注释数据以进行路由器训练，我们建立了一个自动标注管道，利用视觉-语言模型（VLMs）来最小化数据收集和注释成本。在真实世界的机器人操作场景中的实验结果表明，与现有的VLA模型相比，所提出的方法在推理效率和控制性能上都取得了显著改善，验证了动态信息融合在资源受限的实时机器人控制环境中的有效性和实用性。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.15549

VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing

VLM-DEWM：用于可验证和韧性视觉-语言规划的动态外部世界模型在制造中的应用

Tang, Guoqin, Jia, Qingxuan, Chen, Gang, Li, Tong, Huang, Zeyuan, Lv, Zihang, Ji, Ning

Abstract

Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.

Chinese Translation

视觉-语言模型（VLM）在智能制造中的高层次规划中展现出良好的前景，但其在动态工作单元中的应用面临两个关键挑战：（1）无状态操作，无法持续跟踪视野外的状态，导致世界状态漂移；（2）推理不透明，故障难以诊断，导致昂贵的盲目重试。本文提出了VLM-DEWM，一种认知架构，通过持久的、可查询的动态外部世界模型（DEWM）将VLM推理与世界状态管理解耦。每个VLM决策被结构化为一个可外部化的推理轨迹（ERT），包括行动提案、世界信念和因果假设，在执行前需与DEWM进行验证。当发生故障时，预测状态与观察状态之间的差异分析使得能够进行针对性的恢复，而不是全局重新规划。我们在多站点装配、大规模设施探索和在诱发故障下的真实机器人恢复中评估了VLM-DEWM。与基线的增强记忆VLM系统相比，VLM-DEWM将状态跟踪准确率从56%提高到93%，将恢复成功率从低于5%提升至95%，并通过结构化内存显著降低计算开销。这些结果确立了VLM-DEWM作为动态制造环境中长时间机器人操作的可验证和韧性解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.15567

Constraining Streaming Flow Models for Adapting Learned Robot Trajectory Distributions

约束流动模型以适应学习的机器人轨迹分布

Long, Jieting, Liu, Dechuan, Cai, Weidong, Manchester, Ian, Zhi, Weiming

Abstract

Robot motion distributions often exhibit multi-modality and require flexible generative models for accurate representation. Streaming Flow Policies (SFPs) have recently emerged as a powerful paradigm for generating robot trajectories by integrating learned velocity fields directly in action space, enabling smooth and reactive control. However, existing formulations lack mechanisms for adapting trajectories post-training to enforce safety and task-specific constraints. We propose Constraint-Aware Streaming Flow (CASF), a framework that augments streaming flow policies with constraint-dependent metrics that reshape the learned velocity field during execution. CASF models each constraint, defined in either the robot's workspace or configuration space, as a differentiable distance function that is converted into a local metric and pulled back into the robot's control space. Far from restricted regions, the resulting metric reduces to the identity; near constraint boundaries, it smoothly attenuates or redirects motion, effectively deforming the underlying flow to maintain safety. This allows trajectories to be adapted in real time, ensuring that robot actions respect joint limits, avoid collisions, and remain within feasible workspaces, while preserving the multi-modal and reactive properties of streaming flow policies. We demonstrate CASF in simulated and real-world manipulation tasks, showing that it produces constraint-satisfying trajectories that remain smooth, feasible, and dynamically consistent, outperforming standard post-hoc projection baselines.

Chinese Translation

机器人运动分布通常表现出多模态特性，需要灵活的生成模型以实现准确表示。流动策略（Streaming Flow Policies, SFPs）最近作为一种强大的范式出现，通过直接在动作空间中整合学习到的速度场来生成机器人轨迹，从而实现平滑和反应式控制。然而，现有的公式缺乏在训练后适应轨迹的机制，以强制执行安全性和任务特定的约束。我们提出了约束感知流动（Constraint-Aware Streaming Flow, CASF），这是一个通过约束依赖度量增强流动策略的框架，在执行过程中重塑学习到的速度场。CASF将每个约束（无论是在机器人的工作空间还是配置空间中定义）建模为可微分的距离函数，该函数被转换为局部度量并拉回到机器人的控制空间。在远离受限区域时，结果度量简化为单位；在约束边界附近，它平滑地衰减或重定向运动，有效地变形基础流动以保持安全。这使得轨迹能够实时适应，确保机器人动作遵循关节限制，避免碰撞，并保持在可行的工作空间内，同时保留流动策略的多模态和反应特性。我们在模拟和现实世界的操作任务中演示了CASF，显示其产生满足约束的轨迹，这些轨迹保持平滑、可行且动态一致，优于标准的事后投影基线。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.15608

Grip as Needed, Glide on Demand: Ultrasonic Lubrication for Robotic Locomotion

根据需要抓握，按需滑行：用于机器人运动的超声波润滑

Atalla, Mostafa A., van Bemmel, Daan, Cummings, Jack, Breedveld, Paul, Wiertlewski, Michaël, Sakes, Aimée

Abstract

Friction is the essential mediator of terrestrial locomotion, yet in robotic systems it is almost always treated as a passive property fixed by surface materials and conditions. Here, we introduce ultrasonic lubrication as a method to actively control friction in robotic locomotion. By exciting resonant structures at ultrasonic frequencies, contact interfaces can dynamically switch between "grip" and "slip" states, enabling locomotion. We developed two friction control modules, a cylindrical design for lumen-like environments and a flat-plate design for external surfaces, and integrated them into bio-inspired systems modeled after inchworm and wasp ovipositor locomotion. Both systems achieved bidirectional locomotion with nearly perfect locomotion efficiencies that exceeded 90%. Friction characterization experiments further demonstrated substantial friction reduction across various surfaces, including rigid, soft, granular, and biological tissue interfaces, under dry and wet conditions, and on surfaces with different levels of roughness, confirming the broad applicability of ultrasonic lubrication to locomotion tasks. These findings establish ultrasonic lubrication as a viable active friction control mechanism for robotic locomotion, with the potential to reduce design complexity and improve efficiency of robotic locomotion systems.

Chinese Translation

摩擦是地面运动的基本介质，但在机器人系统中，它几乎总是被视为由表面材料和条件固定的被动属性。在此，我们引入超声波润滑作为一种主动控制机器人运动中摩擦的方法。通过在超声波频率下激励共振结构，接触界面可以动态切换“抓握”和“滑动”状态，从而实现运动。我们开发了两种摩擦控制模块，一种为适用于管腔环境的圆柱形设计，另一种为适用于外部表面的平板设计，并将它们集成到模仿尺蠖和黄蜂产卵器运动的仿生系统中。这两种系统实现了双向运动，运动效率几乎完美，超过90%。摩擦特性实验进一步证明了在干燥和湿润条件下，以及在不同粗糙度的表面上，包括刚性、柔软、颗粒状和生物组织界面，摩擦显著降低，确认了超声波润滑在运动任务中的广泛适用性。这些发现确立了超声波润滑作为一种可行的主动摩擦控制机制，用于机器人运动，具有降低设计复杂性和提高机器人运动系统效率的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.15633

SpecFuse: A Spectral-Temporal Fusion Predictive Control Framework for UAV Landing on Oscillating Marine Platforms

SpecFuse：一种用于无人机在振荡海洋平台上着陆的光谱-时间融合预测控制框架

Liu, Haichao, Hu, Yufeng, Wang, Shuang, Guo, Kangjun, Ma, Jun, Zhou, Jinni

Abstract

Autonomous landing of Uncrewed Aerial Vehicles (UAVs) on oscillating marine platforms is severely constrained by wave-induced multi-frequency oscillations, wind disturbances, and prediction phase lags in motion prediction. Existing methods either treat platform motion as a general random process or lack explicit modeling of wave spectral characteristics, leading to suboptimal performance under dynamic sea conditions. To address these limitations, we propose SpecFuse: a novel spectral-temporal fusion predictive control framework that integrates frequency-domain wave decomposition with time-domain recursive state estimation for high-precision 6-DoF motion forecasting of Uncrewed Surface Vehicles (USVs). The framework explicitly models dominant wave harmonics to mitigate phase lags, refining predictions in real time via IMU data without relying on complex calibration. Additionally, we design a hierarchical control architecture featuring a sampling-based HPO-RRT* algorithm for dynamic trajectory planning under non-convex constraints and a learning-augmented predictive controller that fuses data-driven disturbance compensation with optimization-based execution. Extensive validations (2,000 simulations + 8 lake experiments) show our approach achieves a 3.2 cm prediction error, 4.46 cm landing deviation, 98.7% / 87.5% success rates (simulation / real-world), and 82 ms latency on embedded hardware, outperforming state-of-the-art methods by 44%-48% in accuracy. Its robustness to wave-wind coupling disturbances supports critical maritime missions such as search and rescue and environmental monitoring. All code, experimental configurations, and datasets will be released as open-source to facilitate reproducibility.

Chinese Translation

无人驾驶飞行器（UAV）在振荡海洋平台上的自主着陆受到波浪引起的多频振荡、风干扰和运动预测中的预测相位滞后等因素的严重制约。现有方法要么将平台运动视为一般随机过程，要么缺乏对波浪光谱特征的明确建模，导致在动态海洋条件下的性能不佳。为了解决这些局限性，我们提出了SpecFuse：一种新颖的光谱-时间融合预测控制框架，该框架将频域波浪分解与时域递归状态估计相结合，以实现无人水面载具（USV）的高精度6自由度运动预测。该框架明确建模主要波浪谐波，以减轻相位滞后，通过IMU数据实时优化预测，而无需依赖复杂的校准。此外，我们设计了一个分层控制架构，采用基于采样的HPO-RRT*算法进行非凸约束下的动态轨迹规划，并结合数据驱动的干扰补偿与基于优化的执行的学习增强型预测控制器。大量验证（2000次仿真 + 8次湖泊实验）表明，我们的方法在嵌入式硬件上实现了3.2厘米的预测误差、4.46厘米的着陆偏差、98.7% / 87.5%的成功率（仿真 / 现实世界）以及82毫秒的延迟，准确性比最先进的方法提高了44%-48%。其对波浪-风耦合干扰的鲁棒性支持了关键的海洋任务，如搜救和环境监测。所有代码、实验配置和数据集将作为开源发布，以促进可重复性。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.15642

Spatially-Aware Adaptive Trajectory Optimization with Controller-Guided Feedback for Autonomous Racing

具有控制器引导反馈的空间感知自适应轨迹优化用于自主赛车

Wachter, Alexander, Willert, Alexander, Ecker, Marc-Philip, Hartl-Nesic, Christian

Abstract

We present a closed-loop framework for autonomous raceline optimization that combines NURBS-based trajectory representation, CMA-ES global trajectory optimization, and controller-guided spatial feedback. Instead of treating tracking errors as transient disturbances, our method exploits them as informative signals of local track characteristics via a Kalman-inspired spatial update. This enables the construction of an adaptive, acceleration-based constraint map that iteratively refines trajectories toward near-optimal performance under spatially varying track and vehicle behavior. In simulation, our approach achieves a 17.38% lap time reduction compared to a controller parametrized with maximum static acceleration. On real hardware, tested with different tire compounds ranging from high to low friction, we obtain a 7.60% lap time improvement without explicitly parametrizing friction. This demonstrates robustness to changing grip conditions in real-world scenarios.

Chinese Translation

我们提出了一种闭环框架，用于自主赛车线路优化，该框架结合了基于NURBS的轨迹表示、CMA-ES全局轨迹优化和控制器引导的空间反馈。我们的方法并不将跟踪误差视为瞬态干扰，而是通过受卡尔曼滤波启发的空间更新将其作为局部赛道特征的信息信号。这使得构建一个自适应的基于加速度的约束图成为可能，该约束图能够迭代地优化轨迹，以实现空间变化的赛道和车辆行为下的近似最佳性能。在仿真中，我们的方法与参数化为最大静态加速度的控制器相比，实现了17.38%的圈速提升。在真实硬件上，使用不同的轮胎成分进行测试，从高摩擦到低摩擦，我们在没有明确参数化摩擦的情况下获得了7.60%的圈速改善。这证明了我们的方法在现实场景中对抓地条件变化的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.15684

Estimating Human Muscular Fatigue in Dynamic Collaborative Robotic Tasks with Learning-Based Models

基于学习模型的动态协作机器人任务中人类肌肉疲劳的估计

Kiki, Feras, Niaz, Pouya P., Madani, Alireza, Basdogan, Cagatay

Abstract

Assessing human muscle fatigue is critical for optimizing performance and safety in physical human-robot interaction(pHRI). This work presents a data-driven framework to estimate fatigue in dynamic, cyclic pHRI using arm-mounted surface electromyography(sEMG). Subject-specific machine-learning regression models(Random Forest, XGBoost, and Linear Regression predict the fraction of cycles to fatigue(FCF) from three frequency-domain and one time-domain EMG features, and are benchmarked against a convolutional neural network(CNN) that ingests spectrograms of filtered EMG. Framing fatigue estimation as regression (rather than classification) captures continuous progression toward fatigue, supporting earlier detection, timely intervention, and adaptive robot control. In experiments with ten participants, a collaborative robot under admittance control guided repetitive lateral (left-right) end-effector motions until muscular fatigue. Average FCF RMSE across participants was 20.8+/-4.3% for the CNN, 23.3+/-3.8% for Random Forest, 24.8+/-4.5% for XGBoost, and 26.9+/-6.1% for Linear Regression. To probe cross-task generalization, one participant additionally performed unseen vertical (up-down) and circular repetitions; models trained only on lateral data were tested directly and largely retained accuracy, indicating robustness to changes in movement direction, arm kinematics, and muscle recruitment, while Linear Regression deteriorated. Overall, the study shows that both feature-based ML and spectrogram-based DL can estimate remaining work capacity during repetitive pHRI, with the CNN delivering the lowest error and the tree-based models close behind. The reported transfer to new motion patterns suggests potential for practical fatigue monitoring without retraining for every task, improving operator protection and enabling fatigue-aware shared autonomy, for safer fatigue-adaptive pHRI control.

Chinese Translation

评估人类肌肉疲劳对于优化物理人机交互（pHRI）的性能和安全至关重要。本研究提出了一种数据驱动框架，通过臂部表面肌电图（sEMG）估计动态循环pHRI中的疲劳。针对特定受试者的机器学习回归模型（随机森林、XGBoost和线性回归）利用三个频域和一个时域的肌电特征预测疲劳周期占比（FCF），并与输入滤波肌电图谱的卷积神经网络（CNN）进行了基准比较。将疲劳估计框架设定为回归（而非分类）能够捕捉疲劳的连续进展，支持更早的检测、及时的干预和自适应机器人控制。在十名参与者的实验中，一台在准入控制下的协作机器人引导重复的横向（左右）末端执行器运动，直到肌肉疲劳。参与者的平均FCF均方根误差（RMSE）为CNN 20.8±4.3%、随机森林23.3±3.8%、XGBoost 24.8±4.5%和线性回归26.9±6.1%。为了探讨跨任务的泛化能力，一名参与者还进行了未见的垂直（上下）和圆形重复运动；仅在横向数据上训练的模型被直接测试，且大部分保持了准确性，表明对运动方向、手臂运动学和肌肉招募变化的鲁棒性，而线性回归则表现不佳。总体而言，研究表明基于特征的机器学习和基于谱图的深度学习均可以在重复的pHRI中估计剩余工作能力，其中CNN提供了最低的误差，而树模型紧随其后。报告的新运动模式转移表明，可能在不为每个任务重新训练的情况下实现实用的疲劳监测，从而提高操作员保护并实现疲劳感知的共享自主性，以更安全的疲劳自适应pHRI控制。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.15721

Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems

可扩展的终身多智能体真实测试平台及终身AGV车队管理系统设计选择的综合研究

Yan, Jingtian, Zhang, Yulun, Liu, Zhenting, Zhang, Han, Jiang, He, Chen, Jingkai, Smith, Stephen F., Li, Jiaoyang

Abstract

We present Lifelong Scalable Multi-Agent Realistic Testbed (LSMART), an open-source simulator to evaluate any Multi-Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a variant of MAPF that continuously assigns new goals for agents to reach. LMAPF applications, such as autonomous warehouses, often require a centralized, lifelong system to coordinate the movement of a fleet of robots, typically AGVs. However, existing works on MAPF and LMAPF often assume simplified kinodynamic models, such as pebble motion, as well as perfect execution and communication for AGVs. Prior work has presented SMART, a software capable of evaluating any MAPF algorithms while considering agent kinodynamics, communication delays, and execution uncertainties. However, SMART is designed for MAPF, not LMAPF. Generalizing SMART to an FMS requires many more design choices. First, an FMS parallelizes planning and execution, raising the question of when to plan. Second, given planners with varying optimality and differing agent-model assumptions, one must decide how to plan. Third, when the planner fails to return valid solutions, the system must determine how to recover. In this paper, we first present LSMART, an open-source simulator that incorporates all these considerations to evaluate any MAPF algorithms in an FMS. We then provide experiment results based on state-of-the-art methods for each design choice, offering guidance on how to effectively design centralized lifelong AGV Fleet Management Systems. LSMART is available at https://smart-mapf.github.io/lifelong-smart.

Chinese Translation

我们提出了可扩展的终身多智能体真实测试平台（Lifelong Scalable Multi-Agent Realistic Testbed, LSMART），这是一个开源模拟器，用于评估任何多智能体路径寻找（Multi-Agent Path Finding, MAPF）算法在自动导引车（Automated Guided Vehicles, AGVs）车队管理系统（Fleet Management System, FMS）中的表现。MAPF旨在将一组智能体从其对应的起始位置移动到目标位置。终身MAPF（Lifelong MAPF, LMAPF）是MAPF的一种变体，它持续为智能体分配新的目标。LMAPF的应用，如自主仓库，通常需要一个集中式的终身系统来协调一组机器人的移动，通常是AGVs。然而，现有的MAPF和LMAPF研究往往假设简化的运动动力学模型，如小石子运动，以及AGVs的完美执行和通信。之前的研究提出了SMART，一种能够评估任何MAPF算法的软件，同时考虑智能体的运动动力学、通信延迟和执行不确定性。然而，SMART是为MAPF设计的，而不是LMAPF。将SMART推广到FMS需要更多的设计选择。首先，FMS实现了规划与执行的并行化，这引发了何时规划的问题。其次，考虑到具有不同最优性和不同智能体模型假设的规划者，必须决定如何进行规划。第三，当规划者未能返回有效解决方案时，系统必须确定如何恢复。在本文中，我们首先介绍了LSMART，一个开源模拟器，结合了所有这些考虑因素，以评估任何MAPF算法在FMS中的表现。然后，我们提供了基于每个设计选择的最先进方法的实验结果，为如何有效设计集中式终身AGV车队管理系统提供指导。LSMART可在 https://smart-mapf.github.io/lifelong-smart 获取。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.15733

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

MeshMimic：通过3D场景重建实现几何感知的人形运动学习

Zhang, Qiang, Ma, Jiahao, Liu, Peiran, Shi, Shuai, Su, Zeran, Wang, Zifan, Sun, Jingkai, Cui, Wei, Yu, Jialin, Han, Gang, Zhao, Wen, Sun, Pihai, Yin, Kangning, Wang, Jiaxu, Cao, Jiahang, Zhang, Lingfeng, Cheng, Hao, Hao, Xiaoshuai, Ji, Yiding, Liang, Junwei, Tang, Jian, Xu, Renjing, Guo, Yijie

Abstract

Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.

Chinese Translation

近年来，人形运动控制取得了显著突破，深度强化学习（RL）成为实现复杂人类行为的主要催化剂。然而，人形机器人的高维度和复杂动态使得手动运动设计变得不切实际，导致对昂贵的运动捕捉（MoCap）数据的高度依赖。这些数据集不仅获取成本高，而且常常缺乏周围物理环境所需的几何上下文。因此，现有的运动合成框架往往面临运动与场景的解耦，导致在地形感知任务中出现接触滑移或网格穿透等物理不一致性。在本研究中，我们提出了MeshMimic，这一创新框架将3D场景重建与具身智能相结合，使人形机器人能够直接从视频中学习耦合的“运动-地形”交互。通过利用最先进的3D视觉模型，我们的框架精确地分割和重建人类轨迹以及地形和物体的基础3D几何结构。我们引入了一种基于运动学一致性的优化算法，从噪声视觉重建中提取高质量的运动数据，并提出了一种接触不变的重定向方法，将人类与环境交互特征转移到人形代理上。实验结果表明，MeshMimic在多样且具有挑战性的地形上实现了稳健且高度动态的性能。我们的方法证明，仅利用消费级单目传感器的低成本管道可以促进复杂物理交互的训练，为人形机器人在非结构化环境中的自主演化提供了一条可扩展的路径。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.15767

Robot-Assisted Social Dining as a White Glove Service

机器人辅助社交用餐作为一种白手套服务

Kashyap, Atharva S, Morkute, Ugne Aleksandra, Alves-Oliveira, Patricia

Abstract

Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.

Chinese Translation

机器人辅助喂食使需要帮助进食的残疾人士能够独立且有尊严地享用餐食。然而，现有系统仅在实验室或家庭环境中进行测试，尚未在真实社交用餐场景（例如餐厅）中进行探索。在此类环境中设计机器人面临独特挑战，例如动态和无监督的用餐环境，机器人需要考虑并做出响应。通过与残疾人士进行投机性参与设计，辅以半结构化访谈和定制的基于人工智能的视觉故事板工具，我们揭示了真实社交用餐的理想场景。我们的关键见解表明，这类系统应当：体现白手套服务的原则，其中机器人（1）支持多模态输入和不干扰的输出；（2）具有上下文敏感的社交行为并优先考虑用户；（3）扩展角色超越喂食；（4）适应餐桌上的其他关系。我们的研究对机器人辅助喂食的真实和群体环境具有重要意义。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.15813

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

FAST-EQA：具有全局和局部区域相关性的高效具身问答

Zhang, Haochen, Savaliya, Nirav, Siddiqui, Faizan, Sachdeva, Enna

Abstract

Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.

Chinese Translation

具身问答（EQA）结合了视觉场景理解、目标导向探索、空间和时间推理，且在部分可观察性下进行。一个核心挑战是将物理搜索限制在与问题相关的子空间内，同时保持对观察结果的紧凑、可操作的记忆。此外，在实际部署中，探索过程中的快速推理时间至关重要。我们提出了FAST-EQA，一个基于问题条件的框架，它（i）识别可能的视觉目标，（ii）对全球感兴趣区域进行评分以指导导航，以及（iii）在视觉记忆上采用思维链（Chain-of-Thought, CoT）推理以自信地回答。FAST-EQA维持一个有限的场景记忆，存储固定容量的区域目标假设集并在线更新，从而能够稳健地处理单一和多目标问题，而不会出现无限增长。为了有效扩展覆盖范围，全球探索策略将狭窄开口和门视为高价值前沿，补充了局部目标寻求，且计算量最小。这些组件共同聚焦于代理的注意力，提高场景覆盖率，并提升答案的可靠性，同时运行速度显著快于先前的方法。在HMEQA和EXPRESS-Bench上，FAST-EQA达到了最先进的性能，同时在OpenEQA和MT-HM3D上表现出竞争力。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.15827

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

感知型类人跑酷：通过运动匹配链式组合动态人类技能

Wu, Zhen, Huang, Xiaoyu, Yang, Lujie, Zhang, Yuanhang, Sreenath, Koushil, Chen, Xi, Abbeel, Pieter, Duan, Rocky, Kanazawa, Angjoo, Sferrazza, Carmelo, Shi, Guanya, Liu, C. Karen

Abstract

While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.

Chinese Translation

尽管近年来类人运动的进展已实现了在多种地形上稳定行走，但捕捉高度动态人类动作的灵活性和适应性仍然是一个开放的挑战。尤其是在复杂环境中的灵活跑酷，不仅需要低层次的鲁棒性，还需要类人运动的表现力、长时间技能组合和基于感知的决策能力。本文提出了感知型类人跑酷（Perceptive Humanoid Parkour，PHP），这是一个模块化框架，使类人机器人能够自主地在具有挑战性的障碍课程中执行基于视觉的长时间跑酷。我们的方法首先利用运动匹配，将其表述为特征空间中的最近邻搜索，以将重新定向的原子人类技能组合成长时间的运动轨迹。该框架能够灵活组合复杂的技能链并实现平滑过渡，同时保持动态人类动作的优雅和流畅性。接下来，我们为这些组合动作训练运动跟踪强化学习（RL）专家策略，并使用DAgger和RL的结合将其提炼为单一的基于深度的多技能学生策略。关键是，感知与技能组合的结合使得自主的、上下文感知的决策成为可能：机器人仅使用机载深度传感器和离散的二维速度命令，选择并执行跨越、攀爬、翻越或滚落不同几何形状和高度的障碍物。我们通过在Unitree G1类人机器人上进行广泛的真实世界实验验证了我们的框架，展示了高度动态的跑酷技能，例如攀爬高达1.25米（占机器人高度的96%）的障碍物，以及对实时障碍扰动的闭环适应的长时间多障碍物穿越。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.15828

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

Dex4D：任务无关的点轨迹策略用于仿真到现实的灵巧操控

Kuang, Yuxuan, Park, Sungjae, Fragkiadaki, Katerina, Tulsiani, Shubham

Abstract

Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.

Chinese Translation

学习能够完成众多日常任务的通用策略仍然是灵巧操控中的一个开放挑战。特别是，通过真实世界的遥操作收集大规模操控数据既昂贵又难以扩展。虽然在仿真中学习提供了一种可行的替代方案，但设计多个特定任务的环境和奖励以进行训练同样具有挑战性。我们提出了Dex4D，一个框架，它利用仿真来学习任务无关的灵巧技能，这些技能可以灵活地重新组合以执行多样的现实世界操控任务。具体而言，Dex4D学习了一种领域无关的3D点轨迹条件策略，能够将任何物体操控到任何所需姿态。我们在仿真中训练这种“任意姿态到任意姿态”的策略，涵盖数千个具有不同姿态配置的物体，覆盖了广泛的机器人-物体交互空间，这些交互可以在测试时组合。在部署时，该策略可以在不进行微调的情况下零-shot转移到现实世界任务，只需通过从生成的视频中提取的期望物体中心点轨迹进行提示。在执行过程中，Dex4D使用在线点跟踪进行闭环感知和控制。在仿真和真实机器人上的大量实验表明，我们的方法使得多样的灵巧操控任务的零-shot部署成为可能，并在先前基准上取得了一致的改进。此外，我们展示了对新物体、场景布局、背景和轨迹的强泛化能力，突显了所提框架的鲁棒性和可扩展性。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2602.15072

GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation

GRAFNet：通过引导皮层注意反馈进行多尺度视网膜处理，以增强医学图像息肉分割

Fofanah, Abdul Joseph, Wen, Lian, Kamara, Alpha Alimamy, Zhang, Zhongyi, Chen, David, Sankoh, Albert Patrick

Abstract

Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and the absence of anatomical constraints, often leading to false positives (over-segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation-tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi-feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) that enforces spatial-semantic consistency via resolution-adaptive feedback. Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance, with 3-8% Dice improvements and 10-20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at https://github.com/afofanah/GRAFNet.

Chinese Translation

在结肠镜检查中，准确的息肉分割对于癌症预防至关重要，但由于以下原因仍然具有挑战性：(1) 高形态变异性（从平坦到突出的病变），(2) 与正常结构（如褶皱和血管）之间的强视觉相似性，以及 (3) 需要稳健的多尺度检测。现有的深度学习方法存在单向处理、多尺度融合弱以及缺乏解剖约束等问题，常常导致假阳性（正常结构的过度分割）和假阴性（漏检细微的平坦病变）。我们提出了GRAFNet，这是一种生物启发的架构，模拟人类视觉系统的层次组织。GRAFNet集成了三个关键模块：(1) 引导非对称注意模块（GAAM），模仿方向调谐的皮层神经元以强调息肉边界；(2) 多尺度视网膜模块（MSRM），复制视网膜神经节细胞通路以进行并行多特征分析；(3) 引导皮层注意反馈模块（GCAFM），应用预测编码进行迭代精细化。这些模块统一在一个息肉编码器-解码器模块（PEDM）中，通过分辨率自适应反馈强制空间-语义一致性。在五个公共基准（Kvasir-SEG、CVC-300、CVC-ColonDB、CVC-Clinic和PolypGen）上进行的大量实验表明，GRAFNet在性能上始终保持领先，Dice系数提高了3-8%，在泛化能力上比领先方法高出10-20%，同时提供了可解释的决策路径。这项工作建立了一个范式，其中神经计算原则弥合了人工智能准确性与临床可信推理之间的差距。代码可在 https://github.com/afofanah/GRAFNet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.15124

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

基于MLLM的无检测器人-物体交互零样本检测

Xuan, Shiyu, Wang, Dongkai, Li, Zechao, Tang, Jinhui

Abstract

Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.

Chinese Translation

无样本人-物体交互（HOI）检测旨在定位图像中的人类和物体，并识别它们的交互。尽管开放词汇物体检测的进展为物体定位提供了有希望的解决方案，但由于交互的组合多样性，交互识别（IR）仍然具有挑战性。现有方法，包括两阶段方法，将IR与特定检测器紧密耦合，并依赖粗粒度的视觉-语言模型（VLM）特征，这限制了对未见交互的泛化。在本研究中，我们提出了一种解耦框架，将物体检测与IR分离，并利用多模态大型语言模型（MLLMs）进行零样本IR。我们引入了一种确定性生成方法，将IR形式化为视觉问答任务，并强制输出确定性结果，从而实现无训练的零样本IR。为了通过微调模型进一步提升性能和效率，我们设计了一种空间感知池模块，整合外观和成对空间线索，以及一种一次性确定性匹配方法，在单次前向传播中预测所有候选交互。在HICO-DET和V-COCO上的大量实验表明，我们的方法实现了卓越的零样本性能，强大的跨数据集泛化能力，并且能够灵活地与任何物体检测器集成而无需重新训练。代码已公开在 https://github.com/SY-Xuan/DA-HOI。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.15138

MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features

MB-DSMIL-CL-PL：基于对比学习和原型学习的可扩展弱监督卵巢癌亚型分类与定位，使用冻结的补丁特征

Jenkins, Marcus, Mazibrada, Jasenka, Leahu, Bogdan, Mackiewicz, Michal

Abstract

The study of histopathological subtypes is valuable for the personalisation of effective treatment strategies for ovarian cancer. However, increasing diagnostic workloads present a challenge for UK pathology departments, leading to the rise in AI approaches. While traditional approaches in this field have relied on pre-computed, frozen image features, recent advances have shifted towards end-to-end feature extraction, providing an improvement in accuracy but at the expense of significantly reduced scalability during training and time-consuming experimentation. In this paper, we propose a new approach for subtype classification and localisation in ovarian cancer histopathology images using contrastive and prototype learning with pre-computed, frozen features via feature-space augmentations. Compared to DSMIL, our method achieves an improvement of 70.4\% and 15.3\% in F1 score for instance- and slide-level classification, respectively, along with AUC gains of 16.9\% for instance localisation and 2.3\% for slide classification, while maintaining the use of frozen patch features.

Chinese Translation

研究组织病理学亚型对于个性化有效的卵巢癌治疗策略具有重要价值。然而，日益增加的诊断工作负担对英国病理部门构成了挑战，促使人工智能方法的兴起。尽管该领域的传统方法依赖于预计算的冻结图像特征，但最近的进展已转向端到端特征提取，虽然提高了准确性，但在训练期间显著降低了可扩展性，并导致实验耗时。在本文中，我们提出了一种新的方法，通过特征空间增强，使用预计算的冻结特征，结合对比学习和原型学习，对卵巢癌组织病理图像进行亚型分类和定位。与DSMIL相比，我们的方法在实例级和幻灯片级分类的F1分数上分别提高了70.4 ext{%}和15.3 ext{%}，同时在实例定位和幻灯片分类上分别获得了16.9 ext{%}和2.3 ext{%}的AUC提升，同时保持了冻结补丁特征的使用。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.15154

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

损失最为明了：通过损失轨迹检测视频中的标注错误

Alwis, Praditha, Chandra, Soumyadeep, Ravikumar, Deepak, Roy, Kaushik

Abstract

High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.

Chinese Translation

高质量的视频数据集是训练鲁棒模型在动作识别、阶段检测和事件分割等任务中的基础。然而，许多现实世界的视频数据集存在标注错误，例如*错误标记*，即片段被分配了错误的类别标签，以及*顺序混乱*，即时间序列未按照正确的进程进行。这些错误在阶段标注任务中尤其有害，因为时间一致性至关重要。我们提出了一种新颖的模型无关方法，通过分析累积样本损失（Cumulative Sample Loss, CSL）来检测标注错误——CSL被定义为在训练周期中通过保存的模型检查点时，每帧所产生的平均损失。这种逐帧损失轨迹充当了帧级可学习性的动态指纹。错误标记或顺序混乱的帧往往表现出持续较高或不规则的损失模式，因为它们在整个训练过程中对模型来说难以学习，而正确标记的帧通常会较早收敛到低损失。为了计算CSL，我们训练了一个视频分割模型，并在每个周期存储其权重。这些检查点随后用于评估测试视频中每帧的损失。持续高CSL的帧被标记为可能的标注错误候选，包括错误标记或时间对齐错误。我们的方法不需要关于标注错误的真实标签，并且可以在不同数据集上推广。对EgoPER和Cholec80的实验展示了强大的检测性能，有效识别出微妙的不一致性，如错误标记和帧顺序混乱。所提出的方法为数据集审计和提高基于视频的机器学习训练可靠性提供了强有力的工具。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.15167

Distributional Deep Learning for Super-Resolution of 4D Flow MRI under Domain Shift

针对领域转移的4D流动MRI超分辨率的分布式深度学习

Wen, Xiaoyi, Jiang, Fei

Abstract

Super-resolution is widely used in medical imaging to enhance low-quality data, reducing scan time and improving abnormality detection. Conventional super-resolution approaches typically rely on paired datasets of downsampled and original high resolution images, training models to reconstruct high resolution images from their artificially degraded counterparts. However, in real-world clinical settings, low resolution data often arise from acquisition mechanisms that differ significantly from simple downsampling. As a result, these inputs may lie outside the domain of the training data, leading to poor model generalization due to domain shift. To address this limitation, we propose a distributional deep learning framework that improves model robustness and domain generalization. We develop this approch for enhancing the resolution of 4D Flow MRI (4DF). This is a novel imaging modality that captures hemodynamic flow velocity and clinically relevant metrics such as vessel wall stress. These metrics are critical for assessing aneurysm rupture risk. Our model is initially trained on high resolution computational fluid dynamics (CFD) simulations and their downsampled counterparts. It is then fine-tuned on a small, harmonized dataset of paired 4D Flow MRI and CFD samples. We derive the theoretical properties of our distributional estimators and demonstrate that our framework significantly outperforms traditional deep learning approaches through real data applications. This highlights the effectiveness of distributional learning in addressing domain shift and improving super-resolution performance in clinically realistic scenarios.

Chinese Translation

超分辨率在医学成像中被广泛应用于增强低质量数据，减少扫描时间并改善异常检测。传统的超分辨率方法通常依赖于成对的下采样和原始高分辨率图像数据集，训练模型从其人工降质的对应图像中重建高分辨率图像。然而，在真实的临床环境中，低分辨率数据往往源自于与简单下采样显著不同的采集机制。因此，这些输入可能超出了训练数据的领域，导致由于领域转移而出现较差的模型泛化能力。为了解决这一限制，我们提出了一种分布式深度学习框架，以提高模型的鲁棒性和领域泛化能力。我们为增强4D流动MRI（4DF）的分辨率开发了这一方法。这是一种新颖的成像模式，能够捕捉血流动力学流速和临床相关指标，如血管壁应力。这些指标对于评估动脉瘤破裂风险至关重要。我们的模型最初在高分辨率的计算流体动力学（CFD）模拟及其下采样对应物上进行训练。随后，在一小部分和谐的成对4D流动MRI和CFD样本数据集上进行微调。我们推导了分布式估计器的理论性质，并通过真实数据应用证明我们的框架显著优于传统的深度学习方法。这突显了分布式学习在应对领域转移和改善临床现实场景中超分辨率性能方面的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.15181

Time-Archival Camera Virtualization for Sports and Visual Performances

体育和视觉表演的时间归档相机虚拟化

Zhang, Yunxiao, Stone, William, Kumar, Suryansh

Abstract

Camera virtualization -- an emerging solution to novel view synthesis -- holds transformative potential for visual entertainment, live performances, and sports broadcasting by enabling the generation of photorealistic images from novel viewpoints using images from a limited set of calibrated multiple static physical cameras. Despite recent advances, achieving spatially and temporally coherent and photorealistic rendering of dynamic scenes with efficient time-archival capabilities, particularly in fast-paced sports and stage performances, remains challenging for existing approaches. Recent methods based on 3D Gaussian Splatting (3DGS) for dynamic scenes could offer real-time view-synthesis results. Yet, they are hindered by their dependence on accurate 3D point clouds from the structure-from-motion method and their inability to handle large, non-rigid, rapid motions of different subjects (e.g., flips, jumps, articulations, sudden player-to-player transitions). Moreover, independent motions of multiple subjects can break the Gaussian-tracking assumptions commonly used in 4DGS, ST-GS, and other dynamic splatting variants. This paper advocates reconsidering a neural volume rendering formulation for camera virtualization and efficient time-archival capabilities, making it useful for sports broadcasting and related applications. By modeling a dynamic scene as rigid transformations across multiple synchronized camera views at a given time, our method performs neural representation learning, providing enhanced visual rendering quality at test time. A key contribution of our approach is its support for time-archival, i.e., users can revisit any past temporal instance of a dynamic scene and can perform novel view synthesis, enabling retrospective rendering for replay, analysis, and archival of live events, a functionality absent in existing neural rendering approaches and novel view synthesis...

Chinese Translation

相机虚拟化——一种新兴的视图合成解决方案——通过利用来自有限数量的经过校准的静态物理相机的图像，从新视角生成照片级真实感图像，具有改变视觉娱乐、现场表演和体育广播的潜力。尽管近期取得了一些进展，但在快速运动的体育和舞台表演中，实现动态场景的空间和时间一致性以及照片级真实感渲染，特别是具有效率的时间归档能力，仍然是现有方法面临的挑战。基于3D高斯点云（3D Gaussian Splatting, 3DGS）的方法能够为动态场景提供实时视图合成结果，但它们受限于对运动结构法获得的准确3D点云的依赖，以及无法处理不同主体的大规模、非刚性、快速运动（例如翻转、跳跃、关节运动、突发的玩家间转换）。此外，多主体的独立运动可能会破坏在4DGS、ST-GS及其他动态点云变体中常用的高斯跟踪假设。本文提倡重新考虑一种神经体积渲染公式，以实现相机虚拟化和高效的时间归档能力，使其适用于体育广播及相关应用。通过将动态场景建模为在给定时间内多个同步相机视角之间的刚性变换，我们的方法执行神经表示学习，在测试时提供增强的视觉渲染质量。我们方法的一个关键贡献是其对时间归档的支持，即用户可以重新访问动态场景的任何过去时间实例，并进行新视图合成，从而实现对现场事件的回放、分析和归档的回顾性渲染，这在现有的神经渲染方法和新视图合成中是缺失的功能。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.15257

How to Train Your Long-Context Visual Document Model

如何训练您的长上下文视觉文档模型

Veselka, Austin

Abstract

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.

Chinese Translation

我们呈现了首个全面的大规模研究，探讨了训练长上下文视觉语言模型（长达344K上下文），目标是针对长文档的视觉问答，并测量其在长上下文文本中的迁移效果。尽管一些强大的模型，如Qwen3 VL和GLM 4.5/6V是开放权重的，但它们的训练方案和数据管道并不可重复。我们系统地研究了针对24B和32B参数模型的持续预训练、监督微调和偏好优化，借助广泛的长上下文评估和消融实验来弥补这一差距，并在MMLongBenchDoc上实现了两种参数规模的最新性能。此外，我们的关键发现包括：（i）在与评估上下文长度匹配的上下文长度上进行训练优于在更长上下文上进行训练；（ii）使用页面索引进行训练和评估为长文档性能提供了简单且高影响力的提升；（iii）我们的合成数据管道通过持续预训练和监督微调实现自我改进；（iv）我们将已知的文本到视觉长上下文迁移扩展到反向，表明视觉长上下文训练可以迁移到长上下文文本性能。我们还发布了MMLBD-C，这是MMLongBenchDoc的手动修正版本，以减少基准中的错误和低质量示例。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.15277

Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization

通过探索-开发优化加速大规模数据集蒸馏

Alahmadi, Muhammad J., Gao, Peng, Wang, Feiyi, Dongkuan, Xu

Abstract

Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large-scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration-Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being 18x faster, and on ImageNet-21K, our method substantially improves accuracy while remaining 4.3x faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab.

Chinese Translation

数据集蒸馏将原始数据压缩为紧凑的合成数据集，从而减少训练时间和存储需求，同时保持模型性能，使其能够在资源有限的情况下进行部署。尽管近期基于解耦的数据蒸馏方法使得大规模数据集蒸馏成为可能，但它们仍面临效率差距：基于优化的解耦方法实现了更高的准确性，但需要大量计算，而无优化的解耦方法则高效但牺牲了准确性。为了克服这一权衡，我们提出了探索-开发蒸馏（Exploration-Exploitation Distillation, E^2D），这是一种简单实用的方法，通过高效的流程最小化冗余计算，该流程以全图初始化开始，以保持语义完整性和特征多样性。然后，它使用两阶段优化策略：探索阶段进行均匀更新并识别高损失区域，开发阶段则集中更新这些区域以加速收敛。我们在大规模基准上评估了E^2D，结果在ImageNet-1K上超越了最先进的技术，同时速度提高了18倍；在ImageNet-21K上，我们的方法显著提高了准确性，同时速度仍然快4.3倍。这些结果表明，针对性的、减少冗余的更新，而非暴力优化，弥合了大规模数据集蒸馏中准确性与效率之间的差距。代码可在 https://github.com/ncsu-dk-lab 获取。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.15278

Visual Persuasion: What Influences Decisions of Vision-Language Models?

视觉说服：什么影响视觉-语言模型的决策？

Cherep, Manuel, R, Pranav M, Maes, Pattie, Singh, Nikhil

Abstract

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

Chinese Translation

网络上充斥着图像，这些图像最初是为人类消费而创建的，现在越来越多地被使用视觉-语言模型（VLMs）的代理进行解读。这些代理在大规模上做出视觉决策，决定点击、推荐或购买什么。然而，我们对它们的视觉偏好的结构知之甚少。我们通过将VLMs置于受控的基于图像的选择任务中，并系统性地扰动其输入，提出了一个研究框架。我们的关键思想是将代理的决策函数视为一种潜在的视觉效用，可以通过揭示偏好进行推断：在系统编辑的图像之间进行选择。从常见图像（如产品照片）出发，我们提出了视觉提示优化的方法，适应文本优化方法，迭代地提出和应用使用图像生成模型（如在构图、照明或背景方面）进行视觉上合理的修改。然后，我们评估哪些编辑增加了选择概率。通过对前沿VLMs的大规模实验，我们证明了优化后的编辑在正面对比中显著改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好，识别出驱动选择的一致视觉主题。我们认为，这种方法提供了一种实用而高效的方式来揭示视觉脆弱性，这些安全隐患在野外可能会隐性发现，从而支持对基于图像的人工智能代理进行更积极的审计和治理。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.15287

Consistency-Preserving Diverse Video Generation

保持一致性的多样化视频生成

Liu, Xinshuang, Li, Runfa Blark, Nguyen, Truong

Abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.

Chinese Translation

文本到视频生成的成本较高，因此通常每个提示仅生成少量样本。在这种低样本情况下，最大化每个批次的价值需要高跨视频多样性。近期的方法提高了图像生成的多样性，但在视频生成中往往会降低视频内的时间一致性，并且需要通过视频解码器进行昂贵的反向传播。我们提出了一种用于流匹配视频生成器的联合采样框架，旨在提高批次多样性，同时保持时间一致性。我们的方法应用多样性驱动的更新，然后仅去除那些会降低时间一致性目标的组件。为了避免图像空间梯度，我们使用轻量级潜在空间模型计算这两个目标，从而避免视频解码和解码器反向传播。在一个最先进的文本到视频流匹配模型上的实验表明，我们的方法在多样性上可与强联合采样基线相媲美，同时显著提高了时间一致性和颜色自然性。代码将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.15315

Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models

基于二维基础模型的无训练零-shot异常检测在3D脑MRI中的应用

Le-Gia, Tai, Ahn, Jaehyun

Abstract

Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.

Chinese Translation

无监督零-shot异常检测（ZSAD）在医学影像中越来越受到关注，作为一种在没有特定任务监督的情况下识别异常的方法，但大多数进展仍限于二维数据集。将ZSAD扩展到三维医学图像被证明具有挑战性，现有方法依赖于切片特征和视觉-语言模型，无法捕捉体积结构。在本文中，我们提出了一种完全无训练的ZSAD框架，适用于3D脑MRI，通过聚合由二维基础模型处理的多轴切片构建局部体积标记。这些3D补丁标记恢复了立方体空间上下文，并与基于距离的批量级异常检测管道直接集成。该框架提供了紧凑的3D表示，能够在标准GPU上高效计算，并且不需要微调、提示或监督。我们的结果表明，无训练的基于批量的ZSAD可以有效地从二维编码器扩展到完整的3D MRI体积，为体积异常检测提供了一种简单而稳健的方法。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.15318

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Sparrow：基于文本锚定窗口注意力与视觉语义窥视的视听大语言模型中的推测解码

Zhang, Libo, Zhang, Zhaoning, Hong, Wangyang, Qiao, Peng, Li, Dongsheng

Abstract

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.

Chinese Translation

尽管推测解码在加速视觉语言模型（VLMs）推理中被广泛应用，但在视频大语言模型（Vid-LLMs）中应用时却面临严重的性能崩溃。草稿模型通常会陷入注意力稀释和负视觉增益的陷阱，这主要是由于键值缓存爆炸和上下文窗口不匹配。我们观察到Vid-LLMs中存在一种视觉语义内化现象，表明在深层交互过程中，关键的视觉语义被隐式编码到文本隐状态中，这使得原始视觉输入在深度推理中结构上变得冗余。为了解决这个问题，我们提出了Sparrow框架，该框架首先通过隐状态重用利用视觉感知的文本锚定窗口注意力，完全将视觉计算卸载到目标模型，并利用中间层视觉状态桥接来用富含语义的中间状态训练草稿模型，从而过滤掉低级视觉噪声。此外，引入了一种多标记预测策略，以弥合训练与推理之间的分布差异。实验表明，Sparrow在处理25k视觉标记时实现了平均2.82倍的加速，有效解决了长序列中的性能下降问题，并为实时长视频任务提供了实用的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.15329

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

EventMemAgent：基于层次事件中心记忆的在线视频理解与自适应工具使用

Wen, Siwei, Wang, Zhangcheng, Zhang, Xingjian, Huang, Lei, Wu, Wenjun

Abstract

Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.

Chinese Translation

在线视频理解要求模型在潜在无限的视觉流中进行持续感知和长距离推理。其根本挑战在于流媒体输入的无限特性与多模态大型语言模型（MLLMs）有限上下文窗口之间的冲突。目前的方法主要依赖被动处理，常常面临在保持长距离上下文与捕捉复杂任务所需的细粒度细节之间的权衡。为了解决这一问题，我们提出了EventMemAgent，一个基于层次记忆模块的主动在线视频代理框架。我们的框架采用双层策略处理在线视频：短期记忆检测事件边界，并利用事件粒度的水库抽样动态处理固定长度缓冲区内的流视频帧；长期记忆则结构化地按事件逐一存档过去的观察。此外，我们整合了多粒度感知工具包以实现主动的、迭代的证据捕获，并采用代理强化学习（Agentic RL）将推理和工具使用策略端到端内化为代理的内在能力。实验表明，EventMemAgent在在线视频基准测试中取得了具有竞争力的结果。代码将发布于此：https://github.com/lingcco/EventMemAgent。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.15346

Effective and Robust Multimodal Medical Image Analysis

有效且稳健的多模态医学图像分析

Dhar, Joy, Zaidi, Nayyar, Haghighat, Maryam

Abstract

Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: https://github.com/misti1203/MAIL-Robust-MAIL.

Chinese Translation

多模态融合学习（Multimodal Fusion Learning, MFL）利用来自不同成像模态（如MRI、CT、SPECT）的异构数据，在解决皮肤癌和脑肿瘤预测等医学问题上展现出巨大潜力。然而，现有的MFL方法面临三个主要限制：a）它们通常专注于特定模态，忽视了跨多种模态的有效共享互补信息，从而限制了其在多疾病分析中的普适性；b）它们依赖于计算成本高昂的模型，限制了在资源有限环境中的适用性；c）它们缺乏对抗攻击的稳健性，影响了医学人工智能应用的可靠性。为了解决这些限制，我们提出了一种新颖的多注意力集成学习（Multi-Attention Integration Learning, MAIL）网络，包含两个关键组件：a）一个高效的残差学习注意力模块，用于捕捉精细的模态特定多尺度模式；b）一个高效的多模态交叉注意力模块，用于学习跨多种模态的丰富互补共享表示。此外，为了确保对抗稳健性，我们扩展MAIL网络，设计了稳健MAIL（Robust-MAIL），通过引入随机投影滤波器和调制注意力噪声。对20个公共数据集的广泛评估表明，MAIL和Robust-MAIL均优于现有方法，性能提升高达9.34%，同时计算成本降低高达78.3%。这些结果突显了我们方法的优越性，确保了比顶尖竞争者更可靠的预测。代码链接： https://github.com/misti1203/MAIL-Robust-MAIL。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.15349

CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset

CREMD：众包情感多模态犬类数据集

Baek, Jinho, Cao, Houwei, Blackwell, Kate

Abstract

Dog emotion recognition plays a crucial role in enhancing human-animal interactions, veterinary care, and the development of automated systems for monitoring canine well-being. However, accurately interpreting dog emotions is challenging due to the subjective nature of emotional assessments and the absence of standardized ground truth methods. We present the CREMD (Crowd-sourced Emotional Multimodal Dogs Dataset), a comprehensive dataset exploring how different presentation modes (e.g., context, audio, video) and annotator characteristics (e.g., dog ownership, gender, professional experience) influence the perception and labeling of dog emotions. The dataset consists of 923 video clips presented in three distinct modes: without context or audio, with context but no audio, and with both context and audio. We analyze annotations from diverse participants, including dog owners, professionals, and individuals with varying demographic backgrounds and experience levels, to identify factors that influence reliable dog emotion recognition. Our findings reveal several key insights: (1) while adding visual context significantly improved annotation agreement, our findings regarding audio cues are inconclusive due to design limitations (specifically, the absence of a no-context-with-audio condition and limited clean audio availability); (2) contrary to expectations, non-owners and male annotators showed higher agreement levels than dog owners and female annotators, respectively, while professionals showed higher agreement levels, aligned with our initial hypothesis; and (3) the presence of audio substantially increased annotators' confidence in identifying specific emotions, particularly anger and fear.

Chinese Translation

犬类情感识别在增强人类与动物的互动、兽医护理以及开发自动化监测犬只福祉的系统中发挥着至关重要的作用。然而，由于情感评估的主观性和缺乏标准化的真实情况方法，准确解读犬类情感具有挑战性。我们提出了CREMD（众包情感多模态犬类数据集），这是一个全面的数据集，探讨不同呈现模式（例如，背景、音频、视频）和标注者特征（例如，犬只拥有情况、性别、专业经验）如何影响犬类情感的感知和标注。该数据集包含923个视频片段，以三种不同的模式呈现：无背景或音频、带背景但无音频、以及同时带背景和音频。我们分析了来自不同参与者的标注，包括犬主、专业人士以及具有不同人口背景和经验水平的个体，以识别影响可靠犬类情感识别的因素。我们的研究结果揭示了几个关键见解：（1）虽然添加视觉背景显著提高了标注一致性，但由于设计限制（具体而言，缺乏无背景带音频的条件和有限的清晰音频可用性），我们关于音频线索的发现并不确定；（2）与预期相反，非犬主和男性标注者的标注一致性水平高于犬主和女性标注者，而专业人士的标注一致性水平更高，这与我们的初步假设一致；（3）音频的存在显著提高了标注者对识别特定情感的信心，特别是愤怒和恐惧。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.15355

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

DAV-GSWT：用于数据高效高斯喷溅王砖的扩散主动视图采样

Fu, Rong, Wu, Jiekai, Wei, Haiyun, Jia, Yee Tan, Zhang, Wenxin, Li, Yang, Ma, Xiaowen, Wu, Wangyu, Fong, Simon

Abstract

The emergence of 3D Gaussian Splatting has fundamentally redefined the capabilities of photorealistic neural rendering by enabling high-throughput synthesis of complex environments. While procedural methods like Wang Tiles have recently been integrated to facilitate the generation of expansive landscapes, these systems typically remain constrained by a reliance on densely sampled exemplar reconstructions. We present DAV-GSWT, a data-efficient framework that leverages diffusion priors and active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations. By integrating a hierarchical uncertainty quantification mechanism with generative diffusion models, our approach autonomously identifies the most informative viewpoints while hallucinating missing structural details to ensure seamless tile transitions. Experimental results indicate that our system significantly reduces the required data volume while maintaining the visual integrity and interactive performance necessary for large-scale virtual environments.

Chinese Translation

三维高斯喷溅的出现从根本上重新定义了光逼真神经渲染的能力，使得复杂环境的高通量合成成为可能。尽管像王砖（Wang Tiles）这样的程序化方法最近被整合以促进广阔景观的生成，但这些系统通常仍然受到对密集采样示例重建的依赖。我们提出了DAV-GSWT，这是一种数据高效的框架，利用扩散先验和主动视图采样，从最少的输入观察中合成高保真高斯喷溅王砖。通过将分层不确定性量化机制与生成性扩散模型相结合，我们的方法能够自主识别最具信息量的视角，同时幻觉缺失的结构细节，以确保无缝的砖块过渡。实验结果表明，我们的系统显著减少了所需的数据量，同时保持了大规模虚拟环境所需的视觉完整性和交互性能。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.15368

GMAIL: Generative Modality Alignment for generated Image Learning

GMAIL：生成图像学习的生成模态对齐

Mo, Shentong, Yun, Sukmin

Abstract

Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.

Chinese Translation

生成模型使得合成高度逼真的图像成为可能，这为训练机器学习模型提供了丰富的数据来源。尽管这些可合成数据源具有优势，但将生成图像不加区分地作为真实图像用于训练，可能会因真实域与合成域之间的模态差异而导致模态崩溃。本文提出了一种新颖的框架，称为GMAIL，专门用于区分性地使用生成图像，明确将生成图像视为与真实图像不同的模态。我们的做法不是在像素空间中不加区分地用生成图像替代真实图像，而是通过多模态学习方法在同一潜在空间中架起这两种不同模态的桥梁。具体而言，我们首先使用跨模态对齐损失对模型进行专门的生成图像微调，然后利用该对齐模型进一步训练各种视觉-语言模型与生成图像。通过对齐这两种模态，我们的方法有效地利用了生成模型的最新进展，从而提升了生成图像学习在多种视觉-语言任务中的有效性。我们的框架可以轻松与各种视觉-语言模型结合，我们通过广泛的实验展示了其有效性。例如，我们的框架显著提高了图像描述、零样本图像检索、零样本图像分类和长描述检索任务的性能。同时，它还显示出生成数据扩展的积极趋势，并在大型多模态模型LLaVA的描述性能上有显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.15383

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

弥合昼夜差异：无配对图像翻译中的目标类别幻觉抑制

Li, Shuwei, Tan, Lei, Tan, Robby T.

Abstract

Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.

Chinese Translation

昼夜无配对图像翻译对下游任务至关重要，但由于外观变化大和缺乏直接的像素级监督，仍然面临挑战。现有方法常常引入语义幻觉，其中目标类别的物体，如交通标志和车辆，以及人造光效应被错误合成。这些幻觉显著降低了下游性能。我们提出了一种新颖的框架，在无配对翻译过程中检测和抑制目标类别特征的幻觉。为了检测幻觉，我们设计了一个双头判别器，额外执行语义分割，以识别背景区域中的幻觉内容。为了抑制这些幻觉，我们引入了类别特定的原型，这些原型通过聚合标注目标领域物体的特征构建，作为每个类别的语义锚点。基于薛定谔桥（Schrodinger Bridge）翻译模型，我们的框架执行迭代精炼，其中检测到的幻觉特征在特征空间中被明确推离类别原型，从而在翻译轨迹中保持物体语义。实验表明，我们的方法在定性和定量上均优于现有方法。在BDD100K数据集上，它在昼夜领域适应中提高了15.5%的平均精度（mAP），对于易受幻觉影响的交通信号灯等类别，提升幅度达到31.7%。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.15396

Efficient Generative Modeling beyond Memoryless Diffusion via Adjoint Schr\"odinger Bridge Matching

超越无记忆扩散的高效生成建模：伴随薛定谔桥匹配

Shin, Jeongwoo, Sul, Jinhwan, Lee, Joonseok, Choi, Jaewong, Choi, Jaemoo

Abstract

Diffusion models often yield highly curved trajectories and noisy score targets due to an uninformative, memoryless forward process that induces independent data-noise coupling. We propose Adjoint Schr\"odinger Bridge Matching (ASBM), a generative modeling framework that recovers optimal trajectories in high dimensions via two stages. First, we view the Schr\"odinger Bridge (SB) forward dynamic as a coupling construction problem and learn it through a data-to-energy sampling perspective that transports data to an energy-defined prior. Then, we learn the backward generative dynamic with a simple matching loss supervised by the induced optimal coupling. By operating in a non-memoryless regime, ASBM produces significantly straighter and more efficient sampling paths. Compared to prior works, ASBM scales to high-dimensional data with notably improved stability and efficiency. Extensive experiments on image generation show that ASBM improves fidelity with fewer sampling steps. We further showcase the effectiveness of our optimal trajectory via distillation to a one-step generator.

Chinese Translation

扩散模型常常由于无信息的无记忆前向过程而产生高度弯曲的轨迹和噪声评分目标，这种过程导致数据与噪声的独立耦合。我们提出了伴随薛定谔桥匹配（Adjoint Schr"odinger Bridge Matching, ASBM），这是一种生成建模框架，通过两个阶段在高维空间中恢复最优轨迹。首先，我们将薛定谔桥（Schr"odinger Bridge, SB）前向动态视为耦合构造问题，并通过数据到能量的采样视角学习它，将数据传输到能量定义的先验中。然后，我们通过一个简单的匹配损失学习反向生成动态，该损失由诱导的最优耦合进行监督。通过在非无记忆的状态下操作，ASBM 生成了显著更直且更高效的采样路径。与之前的工作相比，ASBM 能够扩展到高维数据，并显著提高了稳定性和效率。大量关于图像生成的实验表明，ASBM 在减少采样步骤的同时提高了保真度。我们进一步展示了通过蒸馏到一步生成器来验证我们最优轨迹的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.15461

Emergent Morphing Attack Detection in Open Multi-modal Large Language Models

开放多模态大型语言模型中的新兴变形攻击检测

Ivanovska, Marija, Štruc, Vitomir

Abstract

Face morphing attacks threaten biometric verification, yet most morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, open-source multimodal large language models (MLLMs) have demonstrated strong visual-linguistic reasoning, but their potential in biometric forensics remains underexplored. In this paper, we present the first systematic zero-shot evaluation of open-source MLLMs for single-image MAD, using publicly available weights and a standardized, reproducible protocol. Across diverse morphing techniques, many MLLMs show non-trivial discriminative ability without any fine-tuning or domain adaptation, and LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing highly competitive task-specific MAD baselines by at least 23% in terms of equal error rate (EER). The results indicate that multimodal pretraining can implicitly encode fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Our findings position open-source MLLMs as reproducible, interpretable, and competitive foundations for biometric security and forensic image analysis. This emergent capability also highlights new opportunities to develop state-of-the-art MAD systems through targeted fine-tuning or lightweight adaptation, further improving accuracy and efficiency while preserving interpretability. To support future research, all code and evaluation protocols will be released upon publication.

Chinese Translation

面部变形攻击威胁生物识别验证，然而大多数变形攻击检测（MAD）系统需要特定任务的训练，并且对未见过的攻击类型泛化能力较差。同时，开源多模态大型语言模型（MLLMs）在视觉语言推理方面表现出色，但其在生物识别法医学中的潜力仍未得到充分探索。在本文中，我们首次系统性地对开源MLLMs进行单图像MAD的零样本评估，使用公开可用的权重和标准化、可重复的协议。在多种变形技术中，许多MLLMs在没有任何微调或领域适应的情况下显示出非平凡的区分能力，而LLaVA1.6-Mistral-7B在等错误率（EER）方面达到了最先进的性能，超越了高度竞争的特定任务MAD基线至少23%。结果表明，多模态预训练可以隐式编码细粒度的面部不一致性，这些不一致性表明变形伪影，从而实现零样本法医学敏感性。我们的发现将开源MLLMs定位为可重复、可解释且具有竞争力的生物识别安全和法医学图像分析基础。这一新兴能力还突显了通过针对性微调或轻量级适应开发最先进MAD系统的新机会，进一步提高准确性和效率，同时保持可解释性。为了支持未来的研究，所有代码和评估协议将在发表时发布。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.15490

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

RPT-SR：用于红外图像超分辨率的区域先验注意力变换器

Jin, Youngwan, Park, Incheol, Nalcakan, Yagiz, Ju, Hyeongjin, Yeo, Sanghyeop, Kim, Shiho

Abstract

General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra

Chinese Translation

通用超分辨率模型，特别是视觉变换器，在许多应用中取得了显著成功，但在监控和自动驾驶等常见红外成像场景中表现出基本的低效，这些场景通常从固定或近乎静态的视角进行操作。这些模型未能充分利用此类场景中固有的强大、持久的空间先验，导致冗余学习和次优性能。为了解决这一问题，我们提出了用于红外图像超分辨率的区域先验注意力变换器（RPT-SR），这是一种新颖的架构，明确地将场景布局信息编码到注意力机制中。我们的核心贡献是一个双令牌框架，该框架融合了（1）可学习的区域先验令牌，作为场景全局结构的持久记忆，以及（2）捕捉当前输入帧特定内容的局部令牌。通过将这些令牌用于注意力机制，我们的模型允许先验动态调节局部重建过程。大量实验验证了我们的方法。尽管大多数先前的工作集中在单一的红外波段上，我们通过在涵盖长波（LWIR）和短波（SWIR）光谱的多样数据集上建立新的最先进性能，展示了RPT-SR的广泛适用性和多样性。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.15493

LEADER: Lightweight End-to-End Attention-Gated Dual Autoencoder for Robust Minutiae Extraction

LEADER：轻量级端到端注意力门控双自编码器用于稳健的细节提取

Cappelli, Raffaele, Ferrara, Matteo

Abstract

Minutiae extraction, a fundamental stage in fingerprint recognition, is increasingly shifting toward deep learning. However, truly end-to-end methods that eliminate separate preprocessing and postprocessing steps remain scarce. This paper introduces LEADER (Lightweight End-to-end Attention-gated Dual autoencodER), a neural network that maps raw fingerprint images to minutiae descriptors, including location, direction, and type. The proposed architecture integrates non-maximum suppression and angular decoding to enable complete end-to-end inference using only 0.9M parameters. It employs a novel "Castle-Moat-Rampart" ground-truth encoding and a dual-autoencoder structure, interconnected through an attention-gating mechanism. Experimental evaluations demonstrate state-of-the-art accuracy on plain fingerprints and robust cross-domain generalization to latent impressions. Specifically, LEADER attains a 34% higher F1-score on the NIST SD27 dataset compared to specialized latent minutiae extractors. Sample-level analysis on this challenging benchmark reveals an average rank of 2.07 among all compared methods, with LEADER securing the first-place position in 47% of the samples-more than doubling the frequency of the second-best extractor. The internal representations learned by the model align with established fingerprint domain features, such as segmentation masks, orientation fields, frequency maps, and skeletons. Inference requires 15ms on GPU and 322ms on CPU, outperforming leading commercial software in computational efficiency. The source code and pre-trained weights are publicly released to facilitate reproducibility.

Chinese Translation

细节提取是指纹识别中的一个基本阶段，正日益向深度学习转变。然而，真正消除单独预处理和后处理步骤的端到端方法仍然稀缺。本文介绍了LEADER（轻量级端到端注意力门控双自编码器），这是一种将原始指纹图像映射到细节描述符（包括位置、方向和类型）的神经网络。所提出的架构集成了非最大抑制和角度解码，以仅使用0.9M参数实现完整的端到端推理。它采用了一种新颖的“城堡-护城河-城垣”真实值编码和双自编码器结构，通过注意力门控机制相互连接。实验评估表明，在普通指纹上实现了最先进的准确性，并在潜在印记上展现出稳健的跨域泛化能力。具体而言，LEADER在NIST SD27数据集上获得了比专业潜在细节提取器高34%的F1分数。在这一具有挑战性的基准上进行的样本级分析显示，在所有比较方法中，LEADER的平均排名为2.07，并在47%的样本中获得第一名，超过了第二名提取器的频率。模型学习到的内部表示与已建立的指纹领域特征（如分割掩膜、方向场、频率图和骨架）一致。推理在GPU上需要15毫秒，在CPU上需要322毫秒，在计算效率上优于领先的商业软件。源代码和预训练权重已公开发布，以促进可重复性。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.15516

Semantic-Guided 3D Gaussian Splatting for Transient Object Removal

基于语义引导的3D高斯点云渲染用于瞬态物体移除

Prabakaran, Aditi, Shukla, Priyesh

Abstract

Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.

Chinese Translation

在随意的多视角捕捉中，瞬态物体会导致3D高斯点云渲染（3DGS）重建中的鬼影伪影。现有的解决方案依赖于场景分解，代价高昂的内存消耗，或基于运动的启发式方法，这些方法容易受到视差模糊的影响。我们提出了一种语义过滤框架，利用视觉-语言模型进行类别感知的瞬态物体移除。在训练迭代中，渲染视图与干扰文本提示之间的CLIP相似度分数被按高斯分布累积。超过校准阈值的高斯分布经过不透明度正则化和周期性修剪。与基于运动的方法不同，语义分类通过独立于运动模式识别物体类别来解决视差模糊问题。在RobustNeRF基准上的实验表明，在四个序列中，与传统的3DGS相比，重建质量持续改善，同时保持最低的内存开销和实时渲染性能。阈值校准和与基线的比较验证了语义引导作为在可预测的干扰类别场景中进行瞬态移除的实用策略。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.15535

Advanced Acceptance Score: A Holistic Measure for Biometric Quantification

高级接受分数：生物特征量化的整体衡量标准

Verma, Aman, Srirangarajan, Seshan, Roy, Sumantra Dutta

Abstract

Quantifying biometric characteristics within hand gestures involve derivation of fitness scores from a gesture and identity aware feature space. However, evaluating the quality of these scores remains an open question. Existing biometric capacity estimation literature relies upon error rates. But these rates do not indicate goodness of scores. Thus, in this manuscript we present an exhaustive set of evaluation measures. We firstly identify ranking order and relevance of output scores as the primary basis for evaluation. In particular, we consider both rank deviation as well as rewards for: (i) higher scores of high ranked gestures and (ii) lower scores of low ranked gestures. We also compensate for correspondence between trends of output and ground truth scores. Finally, we account for disentanglement between identity features of gestures as a discounting factor. Integrating these elements with adequate weighting, we formulate advanced acceptance score as a holistic evaluation measure. To assess effectivity of the proposed we perform in-depth experimentation over three datasets with five state-of-the-art (SOTA) models. Results show that the optimal score selected with our measure is more appropriate than existing other measures. Also, our proposed measure depicts correlation with existing measures. This further validates its reliability. We have made our \href{https://github.com/AmanVerma2307/MeasureSuite}{code} public.

Chinese Translation

在手势识别中量化生物特征涉及从手势和身份感知特征空间中推导适应性分数。然而，评估这些分数的质量仍然是一个未解的问题。现有的生物特征容量估计文献依赖于错误率，但这些错误率并不能指示分数的优劣。因此，在本稿中，我们提出了一套全面的评估指标。我们首先确定输出分数的排名顺序和相关性作为评估的主要依据。特别地，我们考虑排名偏差以及对以下情况的奖励：(i) 高排名手势的高分数和 (ii) 低排名手势的低分数。我们还补偿输出趋势与真实分数之间的对应关系。最后，我们将手势的身份特征之间的解耦作为折扣因素进行考虑。通过适当的加权整合这些元素，我们将高级接受分数构建为一种整体评估指标。为了评估所提方法的有效性，我们在三个数据集上对五种最先进的（SOTA）模型进行了深入实验。结果表明，使用我们的方法选择的最佳分数比现有其他方法更为合适。此外，我们提出的指标与现有指标之间存在相关性，这进一步验证了其可靠性。我们已将我们的代码公开。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.15539

Dynamic Training-Free Fusion of Subject and Style LoRAs

动态无训练融合主题与风格的LoRA

Cao, Qinglong, Chen, Yuntian, Ma, Chao, Yang, Xiaokang

Abstract

Recent studies have explored the combination of multiple LoRAs to simultaneously generate user-specified subjects and styles. However, most existing approaches fuse LoRA weights using static statistical heuristics that deviate from LoRA's original purpose of learning adaptive feature adjustments and ignore the randomness of sampled inputs. To address this, we propose a dynamic training-free fusion framework that operates throughout the generation process. During the forward pass, at each LoRA-applied layer, we dynamically compute the KL divergence between the base model's original features and those produced by subject and style LoRAs, respectively, and adaptively select the most appropriate weights for fusion. In the reverse denoising stage, we further refine the generation trajectory by dynamically applying gradient-based corrections derived from objective metrics such as CLIP and DINO scores, providing continuous semantic and stylistic guidance. By integrating these two complementary mechanisms-feature-level selection and metric-guided latent adjustment-across the entire diffusion timeline, our method dynamically achieves coherent subject-style synthesis without any retraining. Extensive experiments across diverse subject-style combinations demonstrate that our approach consistently outperforms state-of-the-art LoRA fusion methods both qualitatively and quantitatively.

Chinese Translation

近期研究探讨了多种LoRA的组合，以同时生成用户指定的主题和风格。然而，大多数现有方法使用静态统计启发式方法融合LoRA权重，这偏离了LoRA原本学习自适应特征调整的目的，并忽视了采样输入的随机性。为了解决这一问题，我们提出了一种动态无训练的融合框架，该框架在生成过程中持续运行。在前向传播阶段，在每个应用LoRA的层中，我们动态计算基础模型原始特征与主题和风格LoRA分别生成的特征之间的KL散度，并自适应选择最合适的权重进行融合。在反向去噪阶段，我们通过动态应用基于梯度的修正，进一步优化生成轨迹，这些修正源自于诸如CLIP和DINO分数等客观指标，提供持续的语义和风格指导。通过在整个扩散时间线上整合这两种互补机制——特征级选择和指标引导的潜在调整，我们的方法动态实现了连贯的主题-风格合成，而无需任何再训练。在多种主题-风格组合的广泛实验中，我们的方法在定性和定量上均始终优于最先进的LoRA融合方法。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.15556

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

揭示与增强核心视觉区域：利用内部注意力动态缓解大规模视觉语言模型中的幻觉

Lyu, Guangtao, Liu, Qi, Xu, Chenghao, Yan, Jiexi, Yang, Muli, Li, Xueting, Fang, Fen, Deng, Cheng

Abstract

LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.

Chinese Translation

大规模视觉语言模型（LVLMs）已实现强大的多模态推理能力，但仍然容易产生幻觉，输出与视觉输入或用户指令不一致的结果。现有的无训练方法，包括对比解码和辅助专家模型，虽然可以提高性能，但计算开销往往是其几倍，并可能引入潜在干扰，同时静态内部信号增强也常常容易受到注意力陷阱现象的影响。我们发现，LVLMs中的内部正向注意力动态（Positive Attention Dynamics, PAD）在注意力陷阱的扭曲下自然揭示了语义核心视觉区域。基于此，我们提出了正向注意力动态增强（Positive Attention Dynamics Enhancement, PADE），这是一种无训练的注意力干预方法，构建PAD图以识别语义核心视觉区域，采用每头中位绝对偏差缩放（Median Absolute Deviation Scaling）自适应控制干预强度，并利用系统-标记补偿（System-Token Compensation）来保持对复杂用户指令的关注，并支持长期输出一致性。在多个LVLMs和基准测试上的实验表明，PADE改善了视觉基础和减少了幻觉，验证了利用内部注意力动态进行可靠多模态推理的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.15579

Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning

基于机器学习的冠状动脉内光学相干断层成像图像处理与血管分类

Lahchim, Amal, Athanasiou, Lambros

Abstract

Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.

Chinese Translation

冠状动脉内光学相干断层成像（OCT）能够高分辨率地可视化冠状血管解剖结构，但由于噪声、成像伪影和复杂的组织结构，面临挑战。本文提出了一种完全自动化的管道，用于使用机器学习技术对OCT图像中的血管进行分割和分类。所提方法集成了图像预处理、导丝伪影去除、极坐标到笛卡尔坐标的转换、无监督K均值聚类和局部特征提取。这些特征用于训练逻辑回归（Logistic Regression）和支持向量机（Support Vector Machine）分类器，以实现逐像素的血管分类。实验结果表明，该方法表现优异，精确度、召回率和F1-score值均达到1.00，总体分类准确率为99.68%。所提方法在保持低计算复杂度和最小手动标注的同时，提供了准确的血管边界检测。这种方法为自动化OCT图像分析提供了可靠且高效的解决方案，并在临床决策支持和实时医学图像处理方面具有潜在应用。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.15584

An Industrial Dataset for Scene Acquisitions and Functional Schematics Alignment

用于场景采集和功能原理图对齐的工业数据集

Armangeon, Flavien, Ehret, Thibaud, Meinhardt-Llopis, Enric, von Gioi, Rafael Grompone, Thibault, Guillaume, Petit, Marc, Facciolo, Gabriele

Abstract

Aligning functional schematics with 2D and 3D scene acquisitions is crucial for building digital twins, especially for old industrial facilities that lack native digital models. Current manual alignment using images and LiDAR data does not scale due to tediousness and complexity of industrial sites. Inconsistencies between schematics and reality, and the scarcity of public industrial datasets, make the problem both challenging and underexplored. This paper introduces IRIS-v2, a comprehensive dataset to support further research. It includes images, point clouds, 2D annotated boxes and segmentation masks, a CAD model, 3D pipe routing information, and the P&ID (Piping and Instrumentation Diagram). The alignment is experimented on a practical case study, aiming at reducing the time required for this task by combining segmentation and graph matching.

Chinese Translation

将功能原理图与二维和三维场景采集对齐对于构建数字双胞胎至关重要，特别是对于缺乏原生数字模型的旧工业设施。目前，使用图像和激光雷达（LiDAR）数据进行的手动对齐由于工业场所的繁琐性和复杂性而难以扩展。原理图与现实之间的不一致性以及公共工业数据集的稀缺，使得这一问题既具有挑战性又未被充分探索。本文介绍了IRIS-v2，这是一个全面的数据集，以支持进一步的研究。该数据集包括图像、点云、二维标注框和分割掩码、CAD模型、三维管道布置信息以及P&ID（管道和仪表图）。在一个实际案例研究中进行了对齐实验，旨在通过结合分割和图匹配来减少该任务所需的时间。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.15650

Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

概念增强的多模态检索增强生成：迈向可解释和准确的放射学报告生成

Salmè, Marco, Siciliano, Federico, Silvestri, Fabrizio, Soda, Paolo, Sicilia, Rosa, Guarrasi, Valerio

Abstract

Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.

Chinese Translation

通过视觉-语言模型（VLMs）进行放射学报告生成（RRG）有望减轻文档负担，提高报告一致性，并加速临床工作流程。然而，由于缺乏可解释性以及产生与影像证据不符的幻觉发现的倾向，其临床应用仍然受到限制。现有研究通常将可解释性和准确性视为独立目标，基于概念的可解释性技术主要关注透明度，而检索增强生成（RAG）方法则通过外部检索针对事实基础。我们提出了概念增强的多模态检索增强生成（CEMRAG），这是一个统一框架，将视觉表征分解为可解释的临床概念，并将其与多模态RAG集成。这种方法利用丰富的上下文提示来改善RRG，提高了可解释性和事实准确性。在MIMIC-CXR和IU X-Ray上进行的实验，涵盖多种VLM架构、训练方案和检索配置，显示在临床准确性指标和标准自然语言处理（NLP）度量上，相较于传统RAG和仅基于概念的基线，均有一致的改善。这些结果挑战了可解释性与性能之间的假定权衡，表明透明的视觉概念可以增强而非妨碍医疗VLM中的诊断准确性。我们的模块化设计将可解释性分解为视觉透明度和结构化语言模型条件，为临床可信的AI辅助放射学提供了一条原则性路径。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.15656

A Novel Public Dataset for Strawberry (Fragaria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models

一种新颖的草莓（Fragaria x ananassa）成熟度检测公共数据集及基于YOLO模型的比较评估

Yurdakul, Mustafa, Bastug, Zeynep Sena, Gok, Ali Emre, Taşdemir, Sakir

Abstract

The strawberry (Fragaria x ananassa), known worldwide for its economic value and nutritional richness, is a widely cultivated fruit. Determining the correct ripeness level during the harvest period is crucial for both preventing losses for producers and ensuring consumers receive a quality product. However, traditional methods, i.e., visual assessments alone, can be subjective and have a high margin of error. Therefore, computer-assisted systems are needed. However, the scarcity of comprehensive datasets accessible to everyone in the literature makes it difficult to compare studies in this field. In this study, a new and publicly available strawberry ripeness dataset, consisting of 566 images and 1,201 labeled objects, prepared under variable light and environmental conditions in two different greenhouses in Turkey, is presented to the literature. Comparative tests conducted on the data set using YOLOv8, YOLOv9, and YOLO11-based models showed that the highest precision value was 90.94% in the YOLOv9c model, while the highest recall value was 83.74% in the YOLO11s model. In terms of the general performance criterion mAP@50, YOLOv8s was the best performing model with a success rate of 86.09%. The results show that small and medium-sized models work more balanced and efficiently on this type of dataset, while also establishing a fundamental reference point for smart agriculture applications.

Chinese Translation

草莓（Fragaria x ananassa）因其经济价值和营养丰富而闻名于世，是一种广泛种植的水果。在收获期间确定正确的成熟度水平对防止生产者损失和确保消费者获得优质产品至关重要。然而，传统方法，即仅依赖视觉评估，可能存在主观性和较高的误差范围。因此，需要计算机辅助系统。然而，文献中可供所有人使用的综合数据集稀缺，使得在该领域进行研究比较变得困难。本研究提出了一个新的公开草莓成熟度数据集，该数据集包含566张图像和1201个标注对象，数据是在土耳其两个不同温室中，在不同光照和环境条件下准备的。对该数据集进行的比较测试使用了基于YOLOv8、YOLOv9和YOLO11模型的算法，结果显示YOLOv9c模型的最高精确度为90.94%，而YOLO11s模型的最高召回率为83.74%。在总体性能标准mAP@50方面，YOLOv8s是表现最佳的模型，成功率为86.09%。结果表明，小型和中型模型在此类数据集上工作更为平衡和高效，同时为智能农业应用建立了基本的参考点。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.15660

Bayesian Optimization for Design Parameters of 3D Image Data Analysis

用于三维图像数据分析设计参数的贝叶斯优化

Exler, David, Gómez, Joaquin Eduardo Urrutia, Krüger, Martin, Schliephake, Maike, Jbeily, John, Vitacolonna, Mario, Rudolf, Rüdiger, Reischl, Markus

Abstract

Deep learning-based segmentation and classification are crucial to large-scale biomedical imaging, particularly for 3D data, where manual analysis is impractical. Although many methods exist, selecting suitable models and tuning parameters remains a major bottleneck in practice. Hence, we introduce the 3D data Analysis Optimization Pipeline, a method designed to facilitate the design and parameterization of segmentation and classification using two Bayesian Optimization stages. First, the pipeline selects a segmentation model and optimizes postprocessing parameters using a domain-adapted syntactic benchmark dataset. To ensure a concise evaluation of segmentation performance, we introduce a segmentation quality metric that serves as the objective function. Second, the pipeline optimizes design choices of a classifier, such as encoder and classifier head architectures, incorporation of prior knowledge, and pretraining strategies. To reduce manual annotation effort, this stage includes an assisted class-annotation workflow that extracts predicted instances from the segmentation results and sequentially presents them to the operator, eliminating the need for manual tracking. In four case studies, the 3D data Analysis Optimization Pipeline efficiently identifies effective model and parameter configurations for individual datasets.

Chinese Translation

基于深度学习的分割和分类对于大规模生物医学成像至关重要，尤其是在三维数据中，手动分析是不切实际的。尽管存在许多方法，但选择合适的模型和调整参数仍然是实践中的主要瓶颈。因此，我们引入了三维数据分析优化管道（3D Data Analysis Optimization Pipeline），该方法旨在通过两个贝叶斯优化阶段促进分割和分类的设计与参数化。首先，管道使用领域适应的句法基准数据集选择分割模型并优化后处理参数。为了确保对分割性能的简明评估，我们引入了一种分割质量指标，作为目标函数。其次，管道优化分类器的设计选择，例如编码器和分类器头架构、先验知识的整合以及预训练策略。为了减少手动标注的工作量，此阶段包括一个辅助类标注工作流程，该工作流程从分割结果中提取预测实例，并将其顺序呈现给操作员，从而消除了手动跟踪的需要。在四个案例研究中，三维数据分析优化管道有效识别出适合各个数据集的有效模型和参数配置。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.15712

Criteria-first, semantics-later: reproducible structure discovery in image-based sciences

先标准，后语义：基于图像的科学中的可重复结构发现

Bumberger, Jan

Abstract

Across the natural and life sciences, images have become a primary measurement modality, yet the dominant analytic paradigm remains semantics-first. Structure is recovered by predicting or enforcing domain-specific labels. This paradigm fails systematically under the conditions that make image-based science most valuable, including open-ended scientific discovery, cross-sensor and cross-site comparability, and long-term monitoring in which domain ontologies and associated label sets drift culturally, institutionally, and ecologically. A deductive inversion is proposed in the form of criteria-first and semantics-later. A unified framework for criteria-first structure discovery is introduced. It separates criterion-defined, semantics-free structure extraction from downstream semantic mapping into domain ontologies or vocabularies and provides a domain-general scaffold for reproducible analysis across image-based sciences. Reproducible science requires that the first analytic layer perform criterion-driven, semantics-free structure discovery, yielding stable partitions, structural fields, or hierarchies defined by explicit optimality criteria rather than local domain ontologies. Semantics is not discarded; it is relocated downstream as an explicit mapping from the discovered structural product to a domain ontology or vocabulary, enabling plural interpretations and explicit crosswalks without rewriting upstream extraction. Grounded in cybernetics, observation-as-distinction, and information theory's separation of information from meaning, the argument is supported by cross-domain evidence showing that criteria-first components recur whenever labels do not scale. Finally, consequences are outlined for validation beyond class accuracy and for treating structural products as FAIR, AI-ready digital objects for long-term monitoring and digital twins.

Chinese Translation

在自然科学和生命科学中，图像已成为主要的测量方式，但主导的分析范式仍然是先语义。结构通过预测或强制领域特定标签来恢复。这一范式在使基于图像的科学最有价值的条件下系统性失败，包括开放式科学发现、跨传感器和跨地点的可比性，以及在领域本体和相关标签集在文化、制度和生态上漂移的长期监测中。我们提出了一种演绎反转，形式为先标准后语义。引入了一种统一的先标准结构发现框架。它将标准定义的、无语义的结构提取与下游的语义映射（到领域本体或词汇）分开，并提供了一个领域通用的框架，以便在基于图像的科学中进行可重复的分析。可重复的科学要求第一个分析层执行以标准驱动的、无语义的结构发现，产生由明确的最优标准定义的稳定分区、结构场或层次，而不是局部领域本体。语义并不是被抛弃，而是作为从发现的结构产品到领域本体或词汇的明确映射重新定位到下游，使得多重解释和明确的交叉映射成为可能，而无需重写上游提取。基于控制论、观察作为区分以及信息理论中信息与意义的分离，该论点得到了跨领域证据的支持，显示出每当标签无法扩展时，先标准组件会反复出现。最后，概述了超越类别准确性的验证后果，以及将结构产品视为适合共享、人工智能准备的数字对象，以便进行长期监测和数字双胞胎的处理。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.15720

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

ToaSt：高效ViT的令牌通道选择与结构化剪枝

Moon, Hyunchan, Park, Cheonjun, Waslander, Steven L.

Abstract

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

Chinese Translation

视觉变换器（ViTs）在各种视觉任务中取得了显著成功，但其部署常常受到高昂计算成本的制约。虽然结构化权重剪枝和令牌压缩已成为有前景的解决方案，但它们分别面临着较长的再训练时间和导致优化挑战的全局传播问题。我们提出了ToaSt，这是一个解耦框架，针对不同的ViT组件应用专门的策略。我们对多头自注意力模块应用了耦合的头级结构化剪枝，利用注意力操作的特性来增强鲁棒性。对于前馈网络（占FLOPs的60%以上），我们引入了令牌通道选择（Token Channel Selection, TCS），在提高压缩比的同时避免全局传播问题。我们的分析表明，TCS在选择过程中有效过滤了冗余噪声。对包括DeiT、ViT-MAE和Swin Transformer在内的九个不同模型的广泛评估表明，ToaSt在准确性和效率之间实现了优越的权衡，始终优于现有基线。在ViT-MAE-Huge上，ToaSt实现了88.52%的准确率（+1.64%），并减少了39.4%的FLOPs。ToaSt在下游任务中有效迁移，在COCO目标检测中实现了52.2的mAP，相较于51.9。代码和模型将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.15724

Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

学习检索可导航候选者以实现高效的视觉与语言导航

Gu, Shutian, Huang, Chengkai, Wang, Ruoyu, Yao, Lina

Abstract

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.

Chinese Translation

视觉与语言导航（VLN）要求代理根据自然语言指令在未见过的环境中进行导航。近期的方法越来越多地采用大型语言模型（LLMs）作为高层导航器，因其灵活性和推理能力。然而，基于提示的LLM导航常常面临决策效率低下的问题，因为模型必须在每一步都从头解释指令，并在嘈杂和冗长的可导航候选者中进行推理。本文提出了一种增强检索的框架，以提高基于LLM的VLN的效率和稳定性，而无需修改或微调基础语言模型。我们的方法在两个互补层次上引入检索。在情节层面，指令级嵌入检索器选择语义上相似的成功导航轨迹作为上下文示例，为指令的基础提供任务特定的先验信息。在步骤层面，模仿学习的候选者检索器在LLM推理之前修剪无关的可导航方向，从而减少行动模糊性和提示复杂性。这两个检索模块都是轻量级的、模块化的，并且独立于LLM进行训练。我们在Room-to-Room（R2R）基准上评估了我们的方法。实验结果表明，在已见和未见环境中，成功率、Oracle成功率和SPL均有一致的提升。消融研究进一步表明，指令级示例检索和候选者修剪对全局指导和逐步决策效率提供了互补的好处。这些结果表明，增强检索的决策支持是一种有效且可扩展的策略，用于增强基于LLM的视觉与语言导航。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.15727

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

利用LoRA权重基础跨越视觉类比空间

Manor, Hila, Gal, Rinon, Maron, Haggai, Michaeli, Tomer, Chechik, Gal

Abstract

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

Chinese Translation

视觉类比学习通过示范而非文本描述实现图像操作，使用户能够指定难以用语言表达的复杂变换。给定一个三元组 $\\{\mathbf{a}, \mathbf{a}', \mathbf{b}\\ ext{，目标是生成} \mathbf{b}' ext{，使得} \mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'。最近的方法通过使用单个低秩适配（Low-Rank Adaptation, LoRA）模块将文本到图像模型适应于此任务，但它们面临一个根本性的限制：试图在固定的适配模块中捕捉多样的视觉变换空间限制了泛化能力。受到近期研究的启发，该研究表明在受限领域中的LoRA可以跨越有意义的、可插值的语义空间，我们提出了LoRWeB，这是一种新颖的方法，通过动态组合学习到的变换原语，在推理时为每个类比任务专门化模型，非正式地选择“LoRA空间”中的一个点。我们引入了两个关键组件：（1）一个可学习的LoRA模块基础，以跨越不同视觉变换的空间；（2）一个轻量级编码器，根据输入的类比对动态选择和加权这些基础LoRA。全面的评估表明，我们的方法实现了最先进的性能，并显著改善了对未见视觉变换的泛化能力。我们的发现表明，LoRA基础分解是灵活视觉操作的一个有前景的方向。代码和数据可在 https://research.nvidia.com/labs/par/lorweb 获取。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.15734

Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding

基于语言和几何的稀疏体素表示用于整体场景理解

Wu, Guile, Huang, David, Liu, Bingbing, Bai, Dongfeng

Abstract

Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.

Chinese Translation

现有的3D开放词汇场景理解方法主要强调从2D基础模型中提取语言特征到3D特征场，但在很大程度上忽视了场景外观、语义和几何之间的协同作用。因此，场景理解往往偏离场景的基础几何结构，并与重建过程脱节。在本研究中，我们提出了一种新颖的方法，利用基于语言和几何的稀疏体素表示，在统一框架内全面建模外观、语义和几何。具体而言，我们使用3D稀疏体素作为基本元素，并采用外观场、密度场、特征场和置信场来整体表示3D场景。为了促进外观、密度和特征场之间的协同作用，我们构建了一个特征调制模块，并将语言特征从2D基础模型提取到我们的3D场景模型中。此外，我们将几何蒸馏集成到特征场蒸馏中，通过深度相关正则化和模式一致性正则化将几何知识从几何基础模型转移到我们的3D场景表示中。这些组件共同协作，在统一框架内协同建模3D场景的外观、语义和几何。大量实验表明，我们的方法在整体场景理解和重建方面相比于最先进的方法实现了更优的整体性能。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.15755

RaCo: Ranking and Covariance for Practical Learned Keypoints

RaCo：用于实际学习关键点的排序与协方差

Shenoi, Abhiram, Lindenberger, Philipp, Sarlin, Paul-Edouard, Pollefeys, Marc

Abstract

This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at https://github.com/cvg/RaCo.

Chinese Translation

本文介绍了RaCo，一种轻量级神经网络，旨在学习适用于多种3D计算机视觉任务的鲁棒且多功能的关键点。该模型集成了三个关键组件：可重复的关键点检测器、一个可微分的排序器，用于在有限数量的关键点中最大化匹配，以及一个协方差估计器，用于量化度量尺度下的空间不确定性。RaCo仅在透视图像裁剪上进行训练，无需共视图像对。通过广泛的数据增强，RaCo实现了强大的旋转鲁棒性，即使在不使用计算成本高昂的等变网络架构的情况下。该方法在多个具有挑战性的数据集上进行了评估，展示了在关键点重复性和双视图匹配方面的最先进性能，特别是在大范围平面内旋转下。最终，RaCo提供了一种有效且简单的策略，能够独立估计关键点排序和度量协方差，而无需额外标签，检测可解释且可重复的兴趣点。代码可在 https://github.com/cvg/RaCo 获取。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.15772

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

理解与生成：多模态模型中的优化困境导航

Ye, Sen, Xu, Mengde, Gu, Shuyang, He, Di, Wang, Liwei, Hu, Han

Abstract

Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.

Chinese Translation

当前多模态模型的研究面临一个关键挑战，即增强生成能力往往以理解能力为代价，反之亦然。我们分析了这一权衡，并确定主要原因可能是生成与理解之间的潜在冲突，这在模型内部形成了竞争动态。为了解决这一问题，我们提出了Reason-Reflect-Refine (R3) 框架。该创新算法将单步生成任务重新构架为“生成-理解-再生成”的多步过程。通过在生成过程中明确利用模型的理解能力，我们成功缓解了优化困境，实现了更强的生成结果和与生成过程相关的理解能力的提升。这为设计下一代统一多模态模型提供了宝贵的见解。代码可在 https://github.com/sen-ye/R3 获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.15775

NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy

NeRFscopy：用于内窥镜下活体时变组织的神经辐射场

Salort-Benejam, Laura, Agudo, Antonio

Abstract

Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.

Chinese Translation

内窥镜在医学成像中至关重要，广泛用于诊断、预后和治疗。开发一个稳健的动态三维重建管道用于内窥镜视频，可以增强可视化效果，提高诊断准确性，辅助治疗规划，并指导手术过程。然而，由于组织的可变形特性、单目相机的使用、光照变化、遮挡以及未知的相机轨迹等因素，面临诸多挑战。受到神经渲染的启发，我们提出了NeRFscopy，这是一个自监督的管道，用于从单目视频中合成新视角和重建可变形的内窥镜组织。NeRFscopy包括一个具有典型辐射场的可变形模型和一个由SE(3)变换参数化的时间依赖变形场。此外，通过引入复杂的项来高效利用彩色图像，从数据中学习一个三维隐式模型，而不假设任何模板或预训练模型。NeRFscopy在新视角合成方面取得了准确的结果，在各种具有挑战性的内窥镜场景中超越了竞争方法。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.15782

Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting

气象数据与天空图像结合神经模型用于光伏发电预测

Montoya-Espinagosa, Ines, Agudo, Antonio

Abstract

Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.

Chinese Translation

随着可再生能源作为传统能源替代品的使用增加，尤其是太阳能，研究如何应对光伏发电的可变性挑战的兴趣日益增长，采用不同的方法论。本研究开发了一种基于两个具有相同目的研究的混合方法，用于短期和长期预测。提出了一种多模态方法，将天空图像和光伏能量历史与气象数据相结合。主要目标是提高斜坡事件预测的准确性，增强在多云条件下预测的稳健性，并扩展能力超越即时预测，以支持电网的更高效运行和更好地管理太阳能的可变性。深度神经模型被用于即时预测和预测解决方案，结合了单个和多个气象变量，以及分析太阳位置。结果表明，气象数据的纳入，特别是地表长波辐射、下行辐射以及风与太阳位置的组合，显著改善了当前在即时预测和预测任务中的预测，尤其是在多云天气条件下。本研究强调了整合多样化数据源以提高太阳能预测模型的可靠性和可解释性的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.15783

Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

基于上下文的可扩展图变换器皮肤癌上皮细胞分类

Sancéré, Lucas, Moreau, Noémie, Bozek, Katarzyna

Abstract

Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.

Chinese Translation

癌症患者的全幻灯片图像（WSIs）包含丰富的信息，可用于医学诊断或跟踪治疗进展。为了自动化分析，已经开发了许多基于卷积神经网络和视觉变换器的深度学习方法，并在分割和分类任务中取得了良好的表现。然而，由于WSIs的庞大尺寸和复杂的细胞组织，这些模型依赖于基于图块的表示，导致丧失了重要的组织级上下文。我们提出在全WSI细胞图上使用可扩展的图变换器进行分类。我们在一个具有挑战性的任务上评估了该方法：在皮肤鳞状细胞癌（cSCC）中分类健康与肿瘤上皮细胞，这两种细胞类型表现出非常相似的形态，因此在基于图像的方法中难以区分。我们首先在单个WSI上比较了基于图像和基于图的方法。图变换器模型SGFormer和DIFFormer在3折交叉验证中分别达到了$85.2 imes 1.5$（$ imes$标准误差）和$85.1 imes 2.5$的平衡准确率，而最佳的基于图像的方法达到了$81.2 imes 3.0$。通过评估几种节点特征配置，我们发现最具信息性的表示结合了形态学和纹理特征，以及非上皮细胞的细胞类别，突显了周围细胞上下文的重要性。随后，我们将工作扩展到对来自多位患者的多个WSI进行训练。为了解决基于图像模型的计算约束，我们从每个图像中提取了四个$2560 imes 2560$像素的图块并将其转换为图。在这种情况下，DIFFormer达到了$83.6 imes 1.9$的平衡准确率（3折交叉验证），而最先进的基于图像的模型CellViT256达到了$78.1 imes 0.5$。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.15811

Task-Agnostic Continual Learning for Chest Radiograph Classification

任务无关的持续学习用于胸部X光分类

Kavitha, Muthu Subash, Zafar, Anas, Muneer, Amgad, Wu, Jia

Abstract

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest X-ray datasets arrive sequentially and task identifiers are unavailable at inference. We propose a continual adapter-based routing learning strategy for Chest X-rays (CARL-XRay) that maintains a fixed high-capacity backbone and incrementally allocates lightweight task-specific adapters and classifier heads. A latent task selector operates on task-adapted features and leverages both current and historical context preserved through compact prototypes and feature-level experience replay. This design supports stable task identification and adaptation across sequential updates while avoiding raw-image storage. Experiments on large-scale public chest radiograph datasets demonstrate robust performance retention and reliable task-aware inference under continual dataset ingestion. CARL-XRay outperforms joint training under task-unknown deployment, achieving higher routing accuracy (75.0\% vs.\ 62.5\%), while maintaining competitive diagnostic performance with AUROC of 0.74 in the oracle setting with ground-truth task identity and 0.75 under task-unknown inference, using significantly fewer trainable parameters. Finally, the proposed framework provides a practical alternative to joint training and repeated full retraining in continual clinical deployment.

Chinese Translation

胸部X光分类器的临床部署需要能够在新数据集可用时进行更新的模型，而无需在先前观察到的数据上重新训练或降低经过验证的性能。我们首次研究了胸部X光分类的任务增量持续学习设置，其中异构的胸部X光数据集顺序到达，并且在推理时不可用任务标识符。我们提出了一种基于持续适配器的路由学习策略（CARL-XRay），该策略保持固定的高容量主干网络，并逐步分配轻量级的任务特定适配器和分类头。潜在任务选择器在任务适应特征上操作，并利用通过紧凑原型和特征级经验重放保留的当前和历史上下文。该设计支持在顺序更新中稳定的任务识别和适应，同时避免原始图像存储。在大规模公共胸部X光数据集上的实验表明，在持续数据集摄取下，表现出强大的性能保持和可靠的任务感知推理。CARL-XRay在任务未知部署下优于联合训练，达到更高的路由准确率（75.0% 对比 62.5%），同时在具有真实任务身份的oracle设置中保持竞争力的诊断性能，AUROC为0.74，在任务未知推理下为0.75，且使用的可训练参数显著更少。最后，所提出的框架为持续临床部署提供了联合训练和重复完全重新训练的实用替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.15819

VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

视频素描生成器：视频模型先验促进多样化的顺序素描生成

Ren, Hui, Alaluf, Yuval, Tal, Omer Bar, Schwing, Alexander, Torralba, Antonio, Vinker, Yael

Abstract

Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.

Chinese Translation

素描本质上是一个顺序过程，其中笔画以有意义的顺序绘制，以探索和完善创意。然而，大多数生成模型将素描视为静态图像，忽视了创作绘画背后的时间结构。我们提出了一种数据高效的顺序素描生成方法，该方法将预训练的文本到视频扩散模型适配用于生成素描过程。我们的关键见解是，大型语言模型（LLMs）和视频扩散模型在此任务中提供了互补的优势：LLMs 提供语义规划和笔画排序，而视频扩散模型则作为强大的渲染器，生成高质量、时间上连贯的视觉效果。我们通过将素描表示为短视频，其中笔画在空白画布上逐步绘制，并由文本指定的排序指令引导，来利用这一点。我们引入了一种两阶段的微调策略，将笔画排序的学习与素描外观的学习解耦。笔画排序通过具有控制时间结构的合成形状组合进行学习，而视觉外观则从少至七个手动创作的素描过程中提炼，这些过程捕捉了全球绘制顺序和个别笔画的连续形成。尽管人类绘制的素描数据极为有限，我们的方法仍能生成高质量的顺序素描，紧密遵循文本指定的排序，同时展现丰富的视觉细节。我们进一步通过扩展如笔刷风格条件和自回归素描生成，展示了我们方法的灵活性，增强了额外的可控性和互动协作绘画的能力。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.15067

Attention-gated U-Net model for semantic segmentation of brain tumors and feature extraction for survival prognosis

基于注意力门控U-Net模型的脑肿瘤语义分割及生存预后特征提取

Pate, Rut, Rajput, Snehal, Raval, Mehul S., Kapdi, Rupal A., Roy, Mohendra

Abstract

Gliomas, among the most common primary brain tumors, vary widely in aggressiveness, prognosis, and histology, making treatment challenging due to complex and time-intensive surgical interventions. This study presents an Attention-Gated Recurrent Residual U-Net (R2U-Net) based Triplanar (2.5D) model for improved brain tumor segmentation. The proposed model enhances feature representation and segmentation accuracy by integrating residual, recurrent, and triplanar architectures while maintaining computational efficiency, potentially aiding in better treatment planning. The proposed method achieves a Dice Similarity Score (DSC) of 0.900 for Whole Tumor (WT) segmentation on the BraTS2021 validation set, demonstrating performance comparable to leading models. Additionally, the triplanar network extracts 64 features per planar model for survival days prediction, which are reduced to 28 using an Artificial Neural Network (ANN). This approach achieves an accuracy of 45.71%, a Mean Squared Error (MSE) of 108,318.128, and a Spearman Rank Correlation Coefficient (SRC) of 0.338 on the test dataset.

Chinese Translation

胶质瘤是最常见的原发性脑肿瘤之一，其侵袭性、预后和组织学差异较大，使得治疗面临复杂且耗时的外科干预挑战。本研究提出了一种基于注意力门控递归残差U-Net（R2U-Net）的三平面（2.5D）模型，以改善脑肿瘤的分割效果。所提模型通过整合残差、递归和三平面架构，增强了特征表示和分割精度，同时保持计算效率，可能有助于更好的治疗规划。该方法在BraTS2021验证集上实现了整体肿瘤（WT）分割的Dice相似度评分（DSC）为0.900，显示出与领先模型相当的性能。此外，该三平面网络为生存天数预测提取了每个平面模型64个特征，通过人工神经网络（ANN）将其减少至28个。该方法在测试数据集上达到了45.71%的准确率，均方误差（MSE）为108,318.128，斯皮尔曼等级相关系数（SRC）为0.338。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.15112

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

ResearchGym：在真实世界AI研究中评估语言模型代理

Garikaparthi, Aniketh, Patwardhan, Manasi, Cohan, Arman

Abstract

We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. Yet in a single run, the agent surpasses the solution of an ICML 2025 Spotlight task, indicating that frontier agents can occasionally reach state-of-the-art performance, but do so unreliably. We additionally evaluate proprietary agent scaffolds including Claude Code (Opus-4.5) and Codex (GPT-5.2) which display a similar gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.

Chinese Translation

我们介绍了ResearchGym，这是一个用于评估AI代理在端到端研究中的基准和执行环境。为此，我们重新利用了来自ICML、ICLR和ACL的五篇口头报告和聚光灯论文。从每篇论文的代码库中，我们保留了数据集、评估工具和基线实现，但不包括论文提出的方法。这导致形成了五个容器化任务环境，总共包含39个子任务。在每个环境中，代理必须提出新假设，进行实验，并尝试在论文的指标上超越强大的人工基线。在对一个由GPT-5驱动的代理的控制评估中，我们观察到明显的能力-可靠性差距。该代理在15次评估中仅在1次（6.7%）中超过了代码库提供的基线，提升幅度为11.5%，并且平均仅完成26.5%的子任务。我们识别出反复出现的长期失败模式，包括急躁、糟糕的时间和资源管理、对弱假设的过度自信、协调并行实验的困难以及上下文长度的硬限制。然而，在一次运行中，该代理超越了ICML 2025聚光灯任务的解决方案，这表明前沿代理偶尔能够达到最先进的性能，但其可靠性不足。我们还评估了包括Claude Code（Opus-4.5）和Codex（GPT-5.2）在内的专有代理框架，它们显示出类似的差距。ResearchGym为自主代理在闭环研究中的系统评估和分析提供了基础设施。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.15143

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

通过追踪重写保护语言模型免受未经授权的蒸馏

Ma, Xinhang, Yeoh, William, Zhang, Ning, Vorobeychik, Yevgeniy

Abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

Chinese Translation

知识蒸馏是一种广泛采用的技术，用于将大型语言模型（LLMs）的能力转移到更小、更高效的学生模型中。然而，未经授权的知识蒸馏利用了在开发前沿模型上投入的巨大努力和成本，造成了不公平的优势。我们研究了修改教师生成的推理追踪的方法，以实现两个阻止未经授权蒸馏的目标：（1） extit{反蒸馏}，即降低查询响应的训练有效性，以及（2） extit{API 水印}，即在学生模型中嵌入可验证的签名。我们介绍了几种动态重写教师推理输出的方法，同时保持答案的正确性和语义的一致性。其中两种方法利用了大型语言模型的重写能力，而其他方法则使用基于梯度的技术。我们的实验表明，一种基于简单指令的重写方法在保持或甚至提高教师性能的同时，达到了强大的反蒸馏效果。此外，我们还表明，我们的重写方法能够实现高度可靠的水印检测，几乎没有误报。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.15156

Panini: Continual Learning in Token Space via Structured Memory

Panini：通过结构化记忆在标记空间中的持续学习

Rajesh, Shreyas, Holur, Pavan, Turali, Mehmet Yigit, Duan, Chenda, Roychowdhury, Vwani

Abstract

Language models are increasingly used to reason over content they were not trained on, such as new documents, evolving knowledge, and user-specific data. A common approach is retrieval-augmented generation (RAG), which stores verbatim documents externally (as chunks) and retrieves only a relevant subset at inference time for an LLM to reason over. However, this results in inefficient usage of test-time compute (LLM repeatedly reasons over the same documents); moreover, chunk retrieval can inject irrelevant context that increases unsupported generation. We propose a human-like non-parametric continual learning framework, where the base model remains fixed, and learning occurs by integrating each new experience into an external semantic memory state that accumulates and consolidates itself continually. We present Panini, which realizes this by representing documents as Generative Semantic Workspaces (GSW) -- an entity- and event-aware network of question-answer (QA) pairs, sufficient for an LLM to reconstruct the experienced situations and mine latent knowledge via reasoning-grounded inference chains on the network. Given a query, Panini only traverses the continually-updated GSW (not the verbatim documents or chunks), and retrieves the most likely inference chains. Across six QA benchmarks, Panini achieves the highest average performance, 5%-7% higher than other competitive baselines, while using 2-30x fewer answer-context tokens, supports fully open-source pipelines, and reduces unsupported answers on curated unanswerable queries. The results show that efficient and accurate structuring of experiences at write time -- as achieved by the GSW framework -- yields both efficiency and reliability gains at read time. Code is available at https://github.com/roychowdhuryresearch/gsw-memory.

Chinese Translation

语言模型越来越多地用于推理其未经过训练的内容，例如新文档、不断发展的知识和用户特定数据。一种常见的方法是检索增强生成（RAG），它将逐字文档外部存储（作为块），并在推理时仅检索相关子集供大型语言模型（LLM）进行推理。然而，这导致了测试时计算资源的低效使用（LLM重复对相同文档进行推理）；此外，块检索可能会引入无关的上下文，从而增加不支持的生成。我们提出了一种类人非参数持续学习框架，其中基础模型保持固定，学习通过将每个新经验整合到一个外部语义记忆状态中进行，该状态不断累积和巩固。我们提出了Panini，它通过将文档表示为生成语义工作空间（Generative Semantic Workspaces, GSW）来实现这一点——一个感知实体和事件的问答（Question-Answer, QA）对网络，足以让LLM重构经历的情境并通过基于推理的推断链在网络上挖掘潜在知识。给定查询，Panini仅遍历不断更新的GSW（而不是逐字文档或块），并检索最可能的推理链。在六个问答基准测试中，Panini实现了最高的平均性能，比其他竞争基线高出5%-7%，同时使用的答案上下文标记减少了2-30倍，支持完全开源的管道，并减少了在策划的不可回答查询上的不支持答案。结果表明，在写入时高效且准确地结构化经验——如GSW框架所实现的——在读取时带来了效率和可靠性的提升。代码可在 https://github.com/roychowdhuryresearch/gsw-memory 获取。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.15158

da Costa and Tarski meet Goguen and Carnap: a novel approach for ontological heterogeneity based on consequence systems

达·科斯塔与塔尔斯基相遇戈根与卡尔纳普：基于推论系统的本体异质性新方法

Rocha, Gabriel

Abstract

This paper presents a novel approach for ontological heterogeneity that draws heavily from Carnapian-Goguenism, as presented by Kutz, Mossakowski and L\"ucke (2010). The approach is provisionally designated da Costian-Tarskianism, named after da Costa's Principle of Tolerance in Mathematics and after Alfred Tarski's work on the concept of a consequence operator. The approach is based on the machinery of consequence systems, as developed by Carnielli et al. (2008) and Citkin and Muravitsky (2022), and it introduces the idea of an extended consequence system, which is a consequence system extended with ontological axioms. The paper also defines the concept of an extended development graph, which is a graph structure that allows ontologies to be related via morphisms of extended consequence systems, and additionally via other operations such as fibring and splitting. Finally, we discuss the implications of this approach for the field of applied ontology and suggest directions for future research.

Chinese Translation

本文提出了一种基于卡尔纳普-戈根主义的本体异质性新方法，该方法受到Kutz、Mossakowski和Lücke（2010）研究的启发。该方法暂时被称为达·科斯塔-塔尔斯基主义，以达·科斯塔在数学中的容忍原则和阿尔弗雷德·塔尔斯基关于推论算子概念的研究命名。该方法基于推论系统的机制，该机制由Carnielli等（2008）和Citkin与Muravitsky（2022）发展而来，并引入了扩展推论系统的概念，即在本体公理的基础上扩展的推论系统。本文还定义了扩展发展图的概念，这是一种图结构，允许通过扩展推论系统的态射以及其他操作（如纤维化和分裂）将本体关联起来。最后，我们讨论了该方法对应用本体领域的影响，并提出了未来研究的方向。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.15173

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

注意（DH）差距！推理与对话型大语言模型之间的风险选择对比

Ge, Luise, Zhang, Yongyan, Vorobeychik, Yevgeniy

Abstract

The use of large language models either as decision support systems, or in agentic workflows, is rapidly transforming the digital ecosystem. However, the understanding of LLM decision-making under uncertainty remains limited. We initiate a comparative study of LLM risky choices along two dimensions: (1) prospect representation (explicit vs. experience based) and (2) decision rationale (explanation). Our study, which involves 20 frontier and open LLMs, is complemented by a matched human subjects experiment, which provides one reference point, while an expected payoff maximizing rational agent model provides another. We find that LLMs cluster into two categories: reasoning models (RMs) and conversational models (CMs). RMs tend towards rational behavior, are insensitive to the order of prospects, gain/loss framing, and explanations, and behave similarly whether prospects are explicit or presented via experience history. CMs are significantly less rational, slightly more human-like, sensitive to prospect ordering, framing, and explanation, and exhibit a large description-history gap. Paired comparisons of open LLMs suggest that a key factor differentiating RMs and CMs is training for mathematical reasoning.

Chinese Translation

大型语言模型作为决策支持系统或在自主工作流程中的应用，正在迅速改变数字生态系统。然而，对于大语言模型在不确定性下的决策过程的理解仍然有限。我们启动了一项关于大语言模型风险选择的比较研究，沿着两个维度进行： (1) 前景表示（显式与基于经验）和 (2) 决策理由（解释）。我们的研究涉及20个前沿和开放的大语言模型，并辅以一个匹配的人类受试者实验，提供了一个参考点，而一个期望收益最大化的理性代理模型则提供了另一个参考。我们发现，大语言模型聚集成两类：推理模型（RMs）和对话模型（CMs）。推理模型倾向于理性行为，对前景的顺序、收益/损失框架和解释不敏感，无论前景是显式呈现还是通过经验历史呈现，其行为相似。对话模型则明显不那么理性，稍微更具人类特征，对前景顺序、框架和解释敏感，并表现出显著的描述-历史差距。对开放大语言模型的配对比较表明，区分推理模型和对话模型的一个关键因素是数学推理的训练。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.15212

Secure and Energy-Efficient Wireless Agentic AI Networks

安全且节能的无线自主智能网络

Song, Yuanyan, Wang, Kezhi, Xu, Xinmian

Abstract

In this paper, we introduce a secure wireless agentic AI network comprising one supervisor AI agent and multiple other AI agents to provision quality of service (QoS) for users' reasoning tasks while ensuring confidentiality of private knowledge and reasoning outcomes. Specifically, the supervisor AI agent can dynamically assign other AI agents to participate in cooperative reasoning, while the unselected AI agents act as friendly jammers to degrade the eavesdropper's interception performance. To extend the service duration of AI agents, an energy minimization problem is formulated that jointly optimizes AI agent selection, base station (BS) beamforming, and AI agent transmission power, subject to latency and reasoning accuracy constraints. To address the formulated problem, we propose two resource allocation schemes, ASC and LAW, which first decompose it into three sub-problems. Specifically, ASC optimizes each sub-problem iteratively using the proposed alternating direction method of multipliers (ADMM)-based algorithm, semi-definite relaxation (SDR), and successive convex approximation (SCA), while LAW tackles each sub-problem using the proposed large language model (LLM) optimizer within an agentic workflow. The experimental results show that the proposed solutions can reduce network energy consumption by up to 59.1% compared to other benchmark schemes. Furthermore, the proposed schemes are validated using a practical agentic AI system based on Qwen, demonstrating satisfactory reasoning accuracy across various public benchmarks.

Chinese Translation

在本文中，我们介绍了一种安全的无线自主智能网络，该网络由一个监督AI代理和多个其他AI代理组成，以为用户的推理任务提供服务质量（QoS），同时确保私人知识和推理结果的机密性。具体而言，监督AI代理可以动态分配其他AI代理参与协作推理，而未被选择的AI代理则充当友好的干扰者，以降低窃听者的拦截性能。为了延长AI代理的服务时间，我们提出了一个能量最小化问题，该问题联合优化AI代理选择、基站（BS）波束成形和AI代理传输功率，同时满足延迟和推理准确性约束。为了解决该问题，我们提出了两种资源分配方案，ASC和LAW，首先将其分解为三个子问题。具体而言，ASC使用提出的基于交替方向乘子法（ADMM）算法、半正定松弛（SDR）和连续凸近似（SCA）迭代优化每个子问题，而LAW则在自主工作流中使用提出的大型语言模型（LLM）优化器处理每个子问题。实验结果表明，所提出的解决方案相比其他基准方案可以将网络能耗降低多达59.1%。此外，所提出的方案还通过基于Qwen的实际自主AI系统进行了验证，展示了在各种公共基准测试中令人满意的推理准确性。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.15248

Predicting Invoice Dilution in Supply Chain Finance with Leakage Free Two Stage XGBoost, KAN (Kolmogorov Arnold Networks), and Ensemble Models

利用无泄漏的两阶段XGBoost、KAN（Kolmogorov Arnold Networks）和集成模型预测供应链金融中的发票稀释

Koptev, Pavel, Kumar, Vishnu, Malkov, Konstantin, Shapiro, George, Vikhanov, Yury

Abstract

Invoice or payment dilution is the gap between the approved invoice amount and the actual collection is a significant source of non credit risk and margin loss in supply chain finance. Traditionally, this risk is managed through the buyer's irrevocable payment undertaking (IPU), which commits to full payment without deductions. However, IPUs can hinder supply chain finance adoption, particularly among sub-invested grade buyers. A newer, data-driven methods use real-time dynamic credit limits, projecting dilution for each buyer-supplier pair in real-time. This paper introduces an AI, machine learning framework and evaluates how that can supplement a deterministic algorithm to predict invoice dilution using extensive production dataset across nine key transaction fields.

Chinese Translation

发票或付款稀释是批准的发票金额与实际收款之间的差距，是供应链金融中非信用风险和利润损失的重要来源。传统上，这种风险通过买方的不可撤销付款承诺（IPU）来管理，IPU承诺全额付款而不作扣除。然而，IPU可能会阻碍供应链金融的采用，特别是在投资级以下的买方中。较新的数据驱动方法使用实时动态信用限额，实时预测每个买方-供应商对的稀释情况。本文介绍了一种人工智能和机器学习框架，并评估其如何补充确定性算法，以利用跨越九个关键交易领域的大规模生产数据集预测发票稀释。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.15270

Enhancing Diversity and Feasibility: Joint Population Synthesis from Multi-source Data Using Generative Models

增强多样性与可行性：基于生成模型的多源数据联合人口合成

Abbasi, Farbod, Patterson, Zachary, Farooq, Bilal

Abstract

Generating realistic synthetic populations is essential for agent-based models (ABM) in transportation and urban planning. Current methods face two major limitations. First, many rely on a single dataset or follow a sequential data fusion and generation process, which means they fail to capture the complex interplay between features. Second, these approaches struggle with sampling zeros (valid but unobserved attribute combinations) and structural zeros (infeasible combinations due to logical constraints), which reduce the diversity and feasibility of the generated data. This study proposes a novel method to simultaneously integrate and synthesize multi-source datasets using a Wasserstein Generative Adversarial Network (WGAN) with gradient penalty. This joint learning method improves both the diversity and feasibility of synthetic data by defining a regularization term (inverse gradient penalty) for the generator loss function. For the evaluation, we implement a unified evaluation metric for similarity, and place special emphasis on measuring diversity and feasibility through recall, precision, and the F1 score. Results show that the proposed joint approach outperforms the sequential baseline, with recall increasing by 7\% and precision by 15\%. Additionally, the regularization term further improves diversity and feasibility, reflected in a 10\% increase in recall and 1\% in precision. We assess similarity distributions using a five-metric score. The joint approach performs better overall, and reaches a score of 88.1 compared to 84.6 for the sequential method. Since synthetic populations serve as a key input for ABM, this multi-source generative approach has the potential to significantly enhance the accuracy and reliability of ABM.

Chinese Translation

生成现实的合成人口对于交通和城市规划中的基于代理的模型（ABM）至关重要。目前的方法面临两个主要限制。首先，许多方法依赖于单一数据集或遵循顺序数据融合和生成过程，这意味着它们未能捕捉特征之间的复杂相互作用。其次，这些方法在处理零采样（有效但未观察到的属性组合）和结构零（由于逻辑约束而不可行的组合）方面存在困难，这降低了生成数据的多样性和可行性。本研究提出了一种新方法，通过使用带有梯度惩罚的Wasserstein生成对抗网络（WGAN）同时整合和合成多源数据集。这种联合学习方法通过为生成器损失函数定义正则化项（逆梯度惩罚），改善了合成数据的多样性和可行性。为了评估，我们实施了统一的相似性评估指标，并特别强调通过召回率、精确率和F1分数来测量多样性和可行性。结果表明，所提出的联合方法优于顺序基线，召回率提高了7\%，精确率提高了15\%。此外，正则化项进一步改善了多样性和可行性，召回率提高了10\%，精确率提高了1\%。我们使用五个指标评分评估相似性分布。总体而言，联合方法表现更佳，得分为88.1，而顺序方法为84.6。由于合成人口作为ABM的关键输入，这种多源生成方法有潜力显著提高ABM的准确性和可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.15274

When Remembering and Planning are Worth it: Navigating under Change

记忆与规划的价值：在变化中导航

Madani, Omid, Burns, J. Brian, Eghbali, Reza, Dean, Thomas L.

Abstract

We explore how different types and uses of memory can aid spatial navigation in changing uncertain environments. In the simple foraging task we study, every day, our agent has to find its way from its home, through barriers, to food. Moreover, the world is non-stationary: from day to day, the location of the barriers and food may change, and the agent's sensing such as its location information is uncertain and very limited. Any model construction, such as a map, and use, such as planning, needs to be robust against these challenges, and if any learning is to be useful, it needs to be adequately fast. We look at a range of strategies, from simple to sophisticated, with various uses of memory and learning. We find that an architecture that can incorporate multiple strategies is required to handle (sub)tasks of a different nature, in particular for exploration and search, when food location is not known, and for planning a good path to a remembered (likely) food location. An agent that utilizes non-stationary probability learning techniques to keep updating its (episodic) memories and that uses those memories to build maps and plan on the fly (imperfect maps, i.e. noisy and limited to the agent's experience) can be increasingly and substantially more efficient than the simpler (minimal-memory) agents, as the task difficulties such as distance to goal are raised, as long as the uncertainty, from localization and change, is not too large.

Chinese Translation

我们探讨了不同类型和用途的记忆如何在不断变化的不确定环境中帮助空间导航。在我们研究的简单觅食任务中，代理每天都必须从家中出发，穿越障碍物，找到食物。此外，世界是非平稳的：从一天到另一天，障碍物和食物的位置可能会变化，代理的感知信息（如位置信息）是不确定且非常有限的。任何模型构建（如地图）和使用（如规划）都需要对这些挑战具有鲁棒性，如果任何学习要有用，则需要足够快速。我们考察了一系列策略，从简单到复杂，涉及不同的记忆和学习方式。我们发现，需要一种能够整合多种策略的架构，以处理不同性质的（子）任务，特别是在食物位置未知时的探索和搜索，以及为记住的（可能的）食物位置规划良好路径。当代理利用非平稳概率学习技术不断更新其（情节）记忆，并利用这些记忆构建地图并即时规划（不完美的地图，即噪声和仅限于代理的经验）时，随着任务难度（如目标距离）的提高，其效率可以显著高于简单的（最小记忆）代理，只要来自定位和变化的不确定性不是过于巨大。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.15294

EAA: Automating materials characterization with vision language model agents

EAA：利用视觉语言模型代理自动化材料表征

Du, Ming, Luo, Yanqi, Banerjee, Srutarshi, Wojcik, Michael, Popovic, Jelena, Cherukara, Mathew J.

Abstract

We present Experiment Automation Agents (EAA), a vision-language-model-driven agentic system designed to automate complex experimental microscopy workflows. EAA integrates multimodal reasoning, tool-augmented action, and optional long-term memory to support both autonomous procedures and interactive user-guided measurements. Built on a flexible task-manager architecture, the system enables workflows ranging from fully agent-driven automation to logic-defined routines that embed localized LLM queries. EAA further provides a modern tool ecosystem with two-way compatibility for Model Context Protocol (MCP), allowing instrument-control tools to be consumed or served across applications. We demonstrate EAA at an imaging beamline at the Advanced Photon Source, including automated zone plate focusing, natural language-described feature search, and interactive data acquisition. These results illustrate how vision-capable agents can enhance beamline efficiency, reduce operational burden, and lower the expertise barrier for users.

Chinese Translation

我们提出了实验自动化代理（EAA），这是一个基于视觉语言模型的代理系统，旨在自动化复杂的实验显微镜工作流程。EAA集成了多模态推理、工具增强的操作和可选的长期记忆，以支持自主程序和交互式用户引导的测量。该系统基于灵活的任务管理架构，能够实现从完全代理驱动的自动化到嵌入局部大语言模型（LLM）查询的逻辑定义例程的各种工作流程。EAA进一步提供了一个现代工具生态系统，具有与模型上下文协议（MCP）的双向兼容性，允许仪器控制工具在不同应用之间进行消费或服务。我们在先进光子源的成像光束线展示了EAA，包括自动化的区板聚焦、自然语言描述的特征搜索和交互式数据采集。这些结果展示了具备视觉能力的代理如何提高光束线效率、减少操作负担，并降低用户的专业知识门槛。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.15298

X-MAP: eXplainable Misclassification Analysis and Profiling for Spam and Phishing Detection

X-MAP：可解释的误分类分析与垃圾邮件和网络钓鱼检测的特征分析

Zhang, Qi, Chen, Dian, Kaplan, Lance M., Jøsang, Audun, Jeong, Dong Hyun, Chen, Feng, Cho, Jin-Hee

Abstract

Misclassifications in spam and phishing detection are very harmful, as false negatives expose users to attacks while false positives degrade trust. Existing uncertainty-based detectors can flag potential errors, but possibly be deceived and offer limited interpretability. This paper presents X-MAP, an eXplainable Misclassification Analysis and Profilling framework that reveals topic-level semantic patterns behind model failures. X-MAP combines SHAP-based feature attributions with non-negative matrix factorization to build interpretable topic profiles for reliably classified spam/phishing and legitimate messages, and measures each message's deviation from these profiles using Jensen-Shannon divergence. Experiments on SMS and phishing datasets show that misclassified messages exhibit at least two times larger divergence than correctly classified ones. As a detector, X-MAP achieves up to 0.98 AUROC and lowers the false-rejection rate at 95% TRR to 0.089 on positive predictions. When used as a repair layer on base detectors, it recovers up to 97% of falsely rejected correct predictions with moderate leakage. These results demonstrate X-MAP's effectiveness and interpretability for improving spam and phishing detection.

Chinese Translation

在垃圾邮件和网络钓鱼检测中，误分类是非常有害的，因为假阴性使用户暴露于攻击之中，而假阳性则降低了信任度。现有的基于不确定性的检测器可以标记潜在错误，但可能会受到欺骗，并且提供的可解释性有限。本文提出了X-MAP，一个可解释的误分类分析与特征分析框架，揭示了模型失败背后的主题级语义模式。X-MAP结合了基于SHAP的特征归因与非负矩阵分解，构建了可解释的主题特征，以可靠地分类垃圾邮件/网络钓鱼和合法消息，并使用詹森-香农散度测量每条消息与这些特征的偏差。在SMS和网络钓鱼数据集上的实验表明，误分类消息的散度至少比正确分类的消息大两倍。作为检测器，X-MAP在正预测下达到了高达0.98的AUROC，并将95% TRR下的假拒绝率降低至0.089。当作为基础检测器的修复层使用时，它能够恢复高达97%的错误拒绝的正确预测，且泄漏程度适中。这些结果证明了X-MAP在提高垃圾邮件和网络钓鱼检测的有效性和可解释性方面的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.15325

AgriWorld:A World Tools Protocol Framework for Verifiable Agricultural Reasoning with Code-Executing LLM Agents

AgriWorld：一个用于可验证农业推理的世界工具协议框架，结合代码执行的LLM代理

Zhang, Zhixing, Zhang, Jesen, Liu, Hao, Lv, Qinhan, Yang, Jing, Cai, Kaitong, Wang, Keze

Abstract

Foundation models for agriculture are increasingly trained on massive spatiotemporal data (e.g., multi-spectral remote sensing, soil grids, and field-level management logs) and achieve strong performance on forecasting and monitoring. However, these models lack language-based reasoning and interactive capabilities, limiting their usefulness in real-world agronomic workflows. Meanwhile, large language models (LLMs) excel at interpreting and generating text, but cannot directly reason over high-dimensional, heterogeneous agricultural datasets. We bridge this gap with an agentic framework for agricultural science. It provides a Python execution environment, AgriWorld, exposing unified tools for geospatial queries over field parcels, remote-sensing time-series analytics, crop growth simulation, and task-specific predictors (e.g., yield, stress, and disease risk). On top of this environment, we design a multi-turn LLM agent, Agro-Reflective, that iteratively writes code, observes execution results, and refines its analysis via an execute-observe-refine loop. We introduce AgroBench, with scalable data generation for diverse agricultural QA spanning lookups, forecasting, anomaly detection, and counterfactual "what-if" analysis. Experiments outperform text-only and direct tool-use baselines, validating execution-driven reflection for reliable agricultural reasoning.

Chinese Translation

农业基础模型越来越多地在大规模时空数据（例如，多光谱遥感、土壤网格和田间管理日志）上进行训练，并在预测和监测方面取得了良好的表现。然而，这些模型缺乏基于语言的推理和交互能力，限制了它们在实际农业工作流程中的实用性。与此同时，大型语言模型（LLMs）在文本解释和生成方面表现出色，但无法直接对高维异构农业数据集进行推理。我们通过一个农业科学的代理框架来弥补这一差距。该框架提供了一个Python执行环境AgriWorld，暴露出统一的工具用于对田块进行地理空间查询、遥感时间序列分析、作物生长模拟以及特定任务的预测（例如，产量、压力和疾病风险）。在此环境之上，我们设计了一个多轮LLM代理Agro-Reflective，它通过执行-观察-精炼循环迭代地编写代码、观察执行结果并完善其分析。我们引入了AgroBench，提供可扩展的数据生成，涵盖多样化的农业问答，包括查找、预测、异常检测和反事实“如果”分析。实验结果超越了仅使用文本和直接工具使用的基线，验证了基于执行驱动的反思在可靠农业推理中的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.15384

World-Model-Augmented Web Agents with Action Correction

增强世界模型的网络代理与动作修正

Shen, Zhouzhou, Hu, Xueyu, Li, Xiyun, Fang, Tianqing, Li, Juncheng, Zhang, Shengyu

Abstract

Web agents based on large language models have demonstrated promising capability in automating web tasks. However, current web agents struggle to reason out sensible actions due to the limitations of predicting environment changes, and might not possess comprehensive awareness of execution risks, prematurely performing risky actions that cause losses and lead to task failure. To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement. To overcome the cognitive isolation of individual models, we introduce a multi-agent collaboration process that enables an action model to consult a world model as a web-environment expert for strategic guidance; the action model then grounds these suggestions into executable actions, leveraging prior knowledge of environmental state transition dynamics to enhance candidate action proposal. To achieve risk-aware resilient task execution, we introduce a two-stage deduction chain. A world model, specialized in environmental state transitions, simulates action outcomes, which a judge model then scrutinizes to trigger action corrective feedback when necessary. Experiments show that WAC achieves absolute gains of 1.8% on VisualWebArena and 1.3% on Online-Mind2Web.

Chinese Translation

基于大型语言模型的网络代理在自动化网络任务方面展现了良好的能力。然而，当前的网络代理在推理合理的行动方面存在困难，这主要是由于预测环境变化的局限性，并且可能缺乏对执行风险的全面认识，导致其过早地执行风险较高的行动，从而造成损失并导致任务失败。为了解决这些挑战，我们提出了WAC，一个集成了模型协作、后果模拟和反馈驱动的行动精炼的网络代理。为了克服单个模型的认知孤立，我们引入了一种多代理协作过程，使得行动模型能够咨询作为网络环境专家的世界模型以获取战略指导；然后，行动模型将这些建议转化为可执行的行动，利用对环境状态转移动态的先前知识来增强候选行动的提议。为了实现风险意识的弹性任务执行，我们引入了一个两阶段的推理链。一个专注于环境状态转移的世界模型模拟行动结果，随后由一个判断模型进行审查，以在必要时触发行动修正反馈。实验表明，WAC在VisualWebArena上实现了1.8%的绝对增益，在Online-Mind2Web上实现了1.3%的绝对增益。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.15391

Improving LLM Reliability through Hybrid Abstention and Adaptive Detection

通过混合弃权和自适应检测提高大型语言模型的可靠性

Sharma, Ankit, Tapas, Nachiket, Patra, Jyotiprakash

Abstract

Large Language Models (LLMs) deployed in production environments face a fundamental safety-utility trade-off either a strict filtering mechanisms prevent harmful outputs but often block benign queries or a relaxed controls risk unsafe content generation. Conventional guardrails based on static rules or fixed confidence thresholds are typically context-insensitive and computationally expensive, resulting in high latency and degraded user experience. To address these limitations, we introduce an adaptive abstention system that dynamically adjusts safety thresholds based on real-time contextual signals such as domain and user history. The proposed framework integrates a multi-dimensional detection architecture composed of five parallel detectors, combined through a hierarchical cascade mechanism to optimize both speed and precision. The cascade design reduces unnecessary computation by progressively filtering queries, achieving substantial latency improvements compared to non-cascaded models and external guardrail systems. Extensive evaluation on mixed and domain-specific workloads demonstrates significant reductions in false positives, particularly in sensitive domains such as medical advice and creative writing. The system maintains high safety precision and near-perfect recall under strict operating modes. Overall, our context-aware abstention framework effectively balances safety and utility while preserving performance, offering a scalable solution for reliable LLM deployment.

Chinese Translation

在生产环境中部署的大型语言模型（LLMs）面临着基本的安全性与效用之间的权衡：严格的过滤机制可以防止有害输出，但往往会阻止良性查询，而放松的控制则存在生成不安全内容的风险。基于静态规则或固定置信度阈值的传统防护措施通常对上下文不敏感且计算成本高，导致高延迟和用户体验下降。为了解决这些局限性，我们提出了一种自适应弃权系统，该系统根据实时上下文信号（如领域和用户历史）动态调整安全阈值。所提出的框架集成了一个多维检测架构，由五个并行检测器组成，通过分层级联机制结合，以优化速度和精度。级联设计通过逐步过滤查询减少不必要的计算，与非级联模型和外部防护系统相比，实现了显著的延迟改善。在混合和领域特定工作负载上的广泛评估显示，特别是在医疗建议和创意写作等敏感领域，假阳性显著减少。该系统在严格操作模式下保持高安全精度和近乎完美的召回率。总体而言，我们的上下文感知弃权框架有效地平衡了安全性和效用，同时保持了性能，为可靠的大型语言模型部署提供了可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.15403

Common Belief Revisited

重新审视共同信念

Ågotnes, Thomas

Abstract

Contrary to common belief, common belief is not KD4. If individual belief is KD45, common belief does indeed lose the 5 property and keep the D and 4 properties -- and it has none of the other commonly considered properties of knowledge and belief. But it has another property: $C(C\phi \rightarrow \phi)$ -- corresponding to so-called shift-reflexivity (reflexivity one step ahead). This observation begs the question: is KD4 extended with this axiom a complete characterisation of common belief in the KD45 case? If not, what \emph{is} the logic of common belief? In this paper we show that the answer to the first question is ``no'': there is one additional axiom, and, furthermore, it relies on the number of agents. We show that the result is a complete characterisation of common belief, settling the open problem.

Chinese Translation

与普遍信念相反，普通信念并不是 KD4。如果个体信念是 KD45，则共同信念确实失去了 5 性质，但保留了 D 和 4 性质——并且它没有其他通常被认为是知识和信念的性质。但它具有另一个性质：$C(C heta ightarrow heta)$——对应于所谓的移位自反性（向前一步的自反性）。这一观察引发了一个问题：在 KD45 的情况下，扩展 KD4 以包含这个公理是否是共同信念的完整表征？如果不是，那么共同信念的逻辑是什么？在本文中，我们表明第一个问题的答案是否定的：还有一个额外的公理，而且它依赖于代理的数量。我们展示了这一结果是共同信念的完整表征，从而解决了这一悬而未决的问题。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.15531

GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026), April 27--May 1, 2026, Bergen, Norway

GenAI-LA：生成性人工智能与学习分析研讨会（LAK 2026），2026年4月27日至5月1日，挪威卑尔根

Irigoyen, Javier, Daza, Roberto, Morales, Aythami, Fierrez, Julian, Jurado, Francisco, Ortigosa, Alvaro, Tolosana, Ruben

Abstract

This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.

Chinese Translation

本研究介绍了EduEVAL-DB，这是一个基于教师角色的数据集，旨在支持自动教学评估者和人工智能辅导员对教学解释的评估和培训。该数据集包含854个解释，对应于来自ScienceQA基准的139个问题，涵盖K-12年级的科学、语言和社会科学。每个问题提供一个人类教师的解释，并由大型语言模型（LLM）模拟的教师角色生成六个解释。这些角色受到真实教育实践中观察到的教学风格和不足之处的启发，通过提示工程实现。我们进一步提出了一种与既定教育标准对齐的教学风险评估标准，具体化了五个互补的风险维度：事实正确性、解释深度与完整性、关注点与相关性、学生水平适宜性以及意识形态偏见。所有解释均通过专家教师审核的半自动化过程进行二元风险标签注释。最后，我们展示了初步验证实验，以评估EduEVAL-DB的评估适用性。我们将一种最先进的教育导向模型（Gemini 2.5 Pro）与轻量级本地Llama 3.1 8B模型进行基准测试，并考察在EduEVAL-DB上进行监督微调是否支持可部署在消费者硬件上的模型进行教学风险检测。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.15532

Quantifying construct validity in large language model evaluations

量化大型语言模型评估中的构念效度

Kearns, Ryan Othniel

Abstract

The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Chinese Translation

大型语言模型（LLM）社区常常将基准结果报告为与模型的整体能力同义。然而，基准测试可能存在扭曲性能的问题，例如测试集污染和标注者错误。我们如何知道一个基准是否是我们想要测量的某种能力的可靠指标？这个问题涉及LLM基准的构念效度，并且在我们建模和预测LLM性能时，需要将基准结果与能力分开。社会科学家和计算机科学家都提出了正式模型——潜在因子模型和缩放法则——用于识别基准分数背后的能力。然而，这两种技术在构念效度方面都不令人满意。潜在因子模型忽视了缩放法则，因此它们提取的能力往往代理模型规模。缩放法则忽视了测量误差，因此它们提取的能力既不可解释又过拟合于观察到的基准。本文提出了结构化能力模型，这是第一个从大量LLM基准结果中提取可解释和可推广能力的模型。我在OpenLLM排行榜的大量结果样本上拟合了该模型及其两个替代模型。结构化能力在简约拟合指标上优于潜在因子模型，并且在分布外基准预测方面表现出比缩放法则更好的效果。这些改进之所以可能，是因为现有方法都未能以适当的方式将模型规模与能力分开。模型规模应当像缩放法则那样影响能力，而这些能力应当像潜在因子模型那样影响观察结果，考虑到测量误差。通过结合这两种见解，结构化能力在量化LLM评估中的构念效度方面展示了更好的解释力和预测力。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2602.15553

RUVA: Personalized Transparent On-Device Graph Reasoning

RUVA：个性化透明的设备端图推理

Conte, Gabriele, Mattiace, Alessio, Carmosino, Gianni, Aghilar, Potito, Servedio, Giovanni, Musicco, Francesco, Anelli, Vito Walter, Di Noia, Tommaso, Donini, Francesco Maria

Abstract

The Personal AI landscape is currently dominated by "Black Box" Retrieval-Augmented Generation. While standard vector databases offer statistical matching, they suffer from a fundamental lack of accountability: when an AI hallucinates or retrieves sensitive data, the user cannot inspect the cause nor correct the error. Worse, "deleting" a concept from a vector space is mathematically imprecise, leaving behind probabilistic "ghosts" that violate true privacy. We propose Ruva, the first "Glass Box" architecture designed for Human-in-the-Loop Memory Curation. Ruva grounds Personal AI in a Personal Knowledge Graph, enabling users to inspect what the AI knows and to perform precise redaction of specific facts. By shifting the paradigm from Vector Matching to Graph Reasoning, Ruva ensures the "Right to be Forgotten." Users are the editors of their own lives; Ruva hands them the pen. The project and the demo video are available at http://sisinf00.poliba.it/ruva/.

Chinese Translation

个人人工智能领域目前被“黑箱”检索增强生成技术所主导。虽然标准向量数据库提供统计匹配，但它们在根本上缺乏问责制：当人工智能产生幻觉或检索敏感数据时，用户无法检查原因或纠正错误。更糟糕的是，从向量空间“删除”一个概念在数学上是不精确的，留下了违反真实隐私的概率性“幽灵”。我们提出了Ruva，这是第一个为人机交互记忆管理设计的“玻璃箱”架构。Ruva将个人人工智能与个人知识图谱相结合，使用户能够检查人工智能所知道的内容，并对特定事实进行精确的编辑。通过将范式从向量匹配转变为图推理，Ruva确保了“被遗忘权”。用户是自己生活的编辑；Ruva将笔交给他们。项目和演示视频可在http://sisinf00.poliba.it/ruva/获取。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2602.15580

How Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal Reasoning

视觉如何转化为语言：多模态推理的层级信息论分析

Wu, Hongxuan, Zhang, Yukun, Zhou, Xueqing

Abstract

When a multimodal Transformer answers a visual question, is the prediction driven by visual evidence, linguistic reasoning, or genuinely fused cross-modal computation -- and how does this structure evolve across layers? We address this question with a layer-wise framework based on Partial Information Decomposition (PID) that decomposes the predictive information at each Transformer layer into redundant, vision-unique, language-unique, and synergistic components. To make PID tractable for high-dimensional neural representations, we introduce \emph{PID Flow}, a pipeline combining dimensionality reduction, normalizing-flow Gaussianization, and closed-form Gaussian PID estimation. Applying this framework to LLaVA-1.5-7B and LLaVA-1.6-7B across six GQA reasoning tasks, we uncover a consistent \emph{modal transduction} pattern: visual-unique information peaks early and decays with depth, language-unique information surges in late layers to account for roughly 82\% of the final prediction, and cross-modal synergy remains below 2\%. This trajectory is highly stable across model variants (layer-wise correlations $>$0.96) yet strongly task-dependent, with semantic redundancy governing the detailed information fingerprint. To establish causality, we perform targeted Image$\rightarrow$Question attention knockouts and show that disrupting the primary transduction pathway induces predictable increases in trapped visual-unique information, compensatory synergy, and total information cost -- effects that are strongest in vision-dependent tasks and weakest in high-redundancy tasks. Together, these results provide an information-theoretic, causal account of how vision becomes language in multimodal Transformers, and offer quantitative guidance for identifying architectural bottlenecks where modality-specific information is lost.

Chinese Translation

当一个多模态Transformer回答视觉问题时，预测是由视觉证据、语言推理驱动，还是由真正融合的跨模态计算驱动——这种结构在各层之间是如何演变的？我们通过基于部分信息分解（Partial Information Decomposition, PID）的层级框架来解决这个问题，该框架将每个Transformer层的预测信息分解为冗余、视觉独特、语言独特和协同组件。为了使PID在高维神经表示中可行，我们引入了 extit{PID Flow}，这是一个结合了降维、归一化流高斯化和闭式高斯PID估计的管道。将该框架应用于LLaVA-1.5-7B和LLaVA-1.6-7B在六个GQA推理任务中，我们发现了一种一致的 extit{模态转导}模式：视觉独特信息在早期达到峰值并随着深度衰减，语言独特信息在后期层中激增，约占最终预测的82%，而跨模态协同保持在2%以下。这一轨迹在模型变体之间高度稳定（层级相关性$>$0.96），但强烈依赖于任务，语义冗余主导着详细的信息指纹。为了建立因果关系，我们进行了针对性的图像$ ightarrow$问题注意力击穿实验，结果表明，破坏主要的转导路径会导致被困的视觉独特信息、补偿性协同和总信息成本的可预测增加——这些效应在视觉依赖任务中最强，在高冗余任务中最弱。综合这些结果，我们提供了一个信息论的因果解释，阐明了在多模态Transformer中视觉如何转化为语言，并为识别模态特定信息丢失的架构瓶颈提供了定量指导。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2602.15635

On inferring cumulative constraints

推断累积约束

Sidorov, Konstantin

Abstract

Cumulative constraints are central in scheduling with constraint programming, yet propagation is typically performed per constraint, missing multi-resource interactions and causing severe slowdowns on some benchmarks. I present a preprocessing method for inferring additional cumulative constraints that capture such interactions without search-time probing. This approach interprets cumulative constraints as linear inequalities over occupancy vectors and generates valid inequalities by (i) discovering covers, the sets of tasks that cannot run in parallel, (ii) strengthening the cover inequalities for the discovered sets with lifting, and (iii) injecting the resulting constraints back into the scheduling problem instance. Experiments on standard RCPSP and RCPSP/max test suites show that these inferred constraints improve search performance and tighten objective bounds on favorable instances, while incurring little degradation on unfavorable ones. Additionally, these experiments discover 25 new lower bounds and five new best solutions; eight of the lower bounds are obtained directly from the inferred constraints.

Chinese Translation

累积约束在约束编程的调度中至关重要，然而通常是针对每个约束进行传播，这忽略了多资源之间的交互，并在某些基准测试中导致严重的性能下降。本文提出了一种预处理方法，用于推断额外的累积约束，以捕捉这种交互，而无需在搜索时间内进行探测。该方法将累积约束解释为占用向量上的线性不等式，并通过以下方式生成有效的不等式：(i) 发现覆盖集，即不能并行运行的任务集合，(ii) 通过提升加强发现集的覆盖不等式，以及 (iii) 将生成的约束注入回调度问题实例中。在标准的 RCPSP 和 RCPSP/max 测试套件上的实验表明，这些推断出的约束改善了搜索性能，并在有利实例上收紧了目标界限，同时对不利实例的影响很小。此外，这些实验发现了 25 个新的下界和五个新的最佳解；其中八个下界是直接从推断出的约束中获得的。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2602.15645

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

CARE Drive：评估自动驾驶中视觉语言模型的理由响应能力的框架

Suryana, Lucas Elbert, Bierenga, Farah, van Buuren, Sanne, Kooij, Pepijn, Tulleners, Elsefien, Scari, Federico, Calvert, Simeon, van Arem, Bart, Zgonnikov, Arkady

Abstract

Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.

Chinese Translation

基础模型，包括视觉语言模型，越来越多地应用于自动驾驶，以解释场景、推荐行动和生成自然语言解释。然而，现有的评估方法主要评估基于结果的性能，如安全性和轨迹准确性，而未能确定模型决策是否反映人类相关的考虑。因此，目前尚不清楚这些模型生成的解释是否对应于真正的理由响应决策，或仅仅是事后合理化。这一局限性在安全关键领域尤为重要，因为它可能导致虚假的信心。为了解决这一问题，我们提出了CARE Drive（Context Aware Reasons Evaluation for Driving），这是一个与模型无关的框架，用于评估应用于自动驾驶的视觉语言模型的理由响应能力。CARE Drive在受控的上下文变化下比较基线模型和增强理由模型的决策，以评估人类理由是否对决策行为产生因果影响。该框架采用两阶段评估过程。提示校准确保输出稳定。系统的上下文扰动则测量决策对人类理由（如安全边际、社会压力和效率约束）的敏感性。我们在一个涉及竞争规范考虑的骑行者超车场景中展示了CARE Drive。结果表明，明确的人类理由显著影响模型决策，提高了与专家推荐行为的一致性。然而，响应能力在不同上下文因素之间存在差异，表明对不同类型理由的敏感性不均。这些发现提供了实证证据，表明基础模型中的理由响应能力可以在不修改模型参数的情况下进行系统评估。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2602.15669

PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra

PERSONA：通过激活向量代数实现动态和组合推理时个性控制

Feng, Xiachong, Zhao, Liang, Zhong, Weihong, Huang, Yichong, Gu, Yuxuan, Kong, Lingpeng, Feng, Xiaocheng, Qin, Bing

Abstract

Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.

Chinese Translation

当前大语言模型中的个性控制方法依赖于静态提示或昂贵的微调，未能捕捉人类特质的动态和组合特性。我们提出了PERSONA，一个无训练的框架，通过直接操控激活空间中的个性向量，实现微调级别的性能。我们的关键见解是，个性特征在模型的表示空间中表现为可提取的、近似正交的方向，支持代数运算。该框架通过三个阶段运作：Persona-Base通过对比激活分析提取正交特征向量；Persona-Algebra通过向量算术（标量乘法用于强度，向量加法用于组合，向量减法用于抑制）实现精确控制；Persona-Flow通过在推理过程中动态组合这些向量，实现上下文感知的适应。在PersonalityBench上，我们的方法实现了9.60的平均得分，几乎匹配监督微调的上限9.61，而无需任何梯度更新。在我们提出的Persona-Evolve基准测试中，针对动态个性适应，我们在不同模型系列中实现了高达91%的胜率。这些结果提供了证据，表明大语言模型个性的某些方面在数学上是可处理的，为可解释和高效的行为控制开辟了新的方向。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2602.15725

Recursive Concept Evolution for Compositional Reasoning in Large Language Models

递归概念演化在大型语言模型中的组合推理

Chaudhry, Sarim

Abstract

Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE. Existing methods improve reasoning by expanding token-level search through chain-of-thought prompting, self-consistency, or reinforcement learning, but they leave the model's latent representation space fixed. When the required abstraction is not already encoded in this space, performance collapses. We propose Recursive Concept Evolution (RCE), a framework that enables pretrained language models to modify their internal representation geometry during inference. RCE introduces dynamically generated low-rank concept subspaces that are spawned when representational inadequacy is detected, selected through a minimum description length criterion, merged when synergistic, and consolidated via constrained optimization to preserve stability. This process allows the model to construct new abstractions rather than recombining existing ones. We integrate RCE with Mistral-7B and evaluate it across compositional reasoning benchmarks. RCE yields 12-18 point gains on ARC-AGI-2, 8-14 point improvements on GPQA and BBH, and consistent reductions in depth-induced error on MATH and HLE.

Chinese Translation

大型语言模型在许多复杂推理任务中表现出色，但在需要组合推理的基准测试中（包括 ARC-AGI-2、GPQA、MATH、BBH 和 HLE），其准确性急剧下降。现有方法通过链式思维提示、自我一致性或强化学习扩展了基于标记的搜索，从而改善推理，但它们使模型的潜在表示空间保持不变。当所需的抽象未在该空间中编码时，性能会崩溃。我们提出了递归概念演化（Recursive Concept Evolution, RCE），这是一个框架，使预训练语言模型能够在推理过程中修改其内部表示几何结构。RCE 引入动态生成的低秩概念子空间，当检测到表示不足时生成，通过最小描述长度标准进行选择，在协同作用时合并，并通过约束优化进行巩固以保持稳定性。这个过程使模型能够构建新的抽象，而不是重新组合现有的抽象。我们将 RCE 与 Mistral-7B 集成，并在组合推理基准上进行了评估。RCE 在 ARC-AGI-2 上获得了 12-18 分的提升，在 GPQA 和 BBH 上提高了 8-14 分，并在 MATH 和 HLE 上持续减少了深度引起的错误。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2602.15776

GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems

GlobeDiff：用于多智能体系统部分可观测性的状态扩散过程

Yang, Yiqin, Yang, Xu, Jiang, Yuhua, Mu, Ni, Hu, Hao, Xie, Runpeng, Zhang, Ziyou, Li, Siyuan, Ni, Yuan-Hua, Zhao, Qianchuan, Xu, Bo

Abstract

In the realm of multi-agent systems, the challenge of \emph{partial observability} is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.

Chinese Translation

在多智能体系统领域， extit{部分可观测性}的挑战是有效协调和决策的关键障碍。现有的方法，如信念状态估计和智能体间通信，往往效果不佳。基于信念的方法受到其专注于过去经验的限制，未能充分利用全局信息，而通信方法通常缺乏有效利用其提供的辅助信息的稳健模型。为了解决这一问题，我们提出了全球状态扩散算法（Global State Diffusion Algorithm，GlobeDiff），以根据局部观测推断全局状态。通过将状态推断过程表述为多模态扩散过程，GlobeDiff克服了状态估计中的歧义，同时以高保真度推断全局状态。我们证明了GlobeDiff在单模态和多模态分布下的估计误差是有界的。大量实验结果表明，GlobeDiff实现了优越的性能，能够准确推断全局状态。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2602.15785

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

本研究未涉及人类受试者：验证大型语言模型（LLMs）模拟作为行为证据

Hullman, Jessica, Broska, David, Sun, Huaman, Shaw, Aaron

Abstract

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

Chinese Translation

越来越多的文献使用大型语言模型（LLMs）作为合成参与者，以在社会科学实验中生成成本效益高且几乎即时的反应。然而，目前对于何时此类模拟能够支持对人类行为的有效推断的指导有限。我们对比了两种获取因果效应有效估计的策略，并阐明了每种策略在探索性研究与确认性研究中适用的假设。启发式方法试图通过提示工程、模型微调和其他旨在减少LLM引起的不准确性的修复策略，建立模拟和观察到的人类行为是可互换的。尽管对许多探索性任务有用，启发式方法缺乏确认性研究通常所需的正式统计保证。相反，统计校准将辅助人类数据与统计调整相结合，以考虑观察到的反应与模拟反应之间的差异。在明确的假设下，统计校准保持有效性，并提供比仅依赖人类参与者的实验更低成本的因果效应更精确的估计。然而，这两种方法的潜力取决于LLMs对相关人群的近似程度。当研究人员狭隘地关注用LLMs替代人类参与者时，我们考虑了被忽视的机会。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2602.15791

Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

在大型语言模型编码下增强建筑语义保留的人工智能模型训练

Jang, Suhyung, Lee, Ghang, Lee, Jaekun, Lee, Hyunjun

Abstract

Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey the nuanced relationships among closely related subtypes, limiting AI's semantic comprehension. To address this limitation, this study proposes a novel training approach that employs large language model (LLM) embeddings (e.g., OpenAI GPT and Meta LLaMA) as encodings to preserve finer distinctions in building semantics. We evaluated the proposed method by training GraphSAGE models to classify 42 building object subtypes across five high-rise residential building information models (BIMs). Various embedding dimensions were tested, including original high-dimensional LLM embeddings (1,536, 3,072, or 4,096) and 1,024-dimensional compacted embeddings generated via the Matryoshka representation model. Experimental results demonstrated that LLM encodings outperformed the conventional one-hot baseline, with the llama-3 (compacted) embedding achieving a weighted average F1-score of 0.8766, compared to 0.8475 for one-hot encoding. The results underscore the promise of leveraging LLM-based encodings to enhance AI's ability to interpret complex, domain-specific building semantics. As the capabilities of LLMs and dimensionality reduction techniques continue to evolve, this approach holds considerable potential for broad application in semantic elaboration tasks throughout the AECO industry.

Chinese Translation

准确表示建筑语义，包括通用对象类型和特定子类型，对于建筑、工程、施工和运营（AECO）行业中有效的人工智能模型训练至关重要。传统编码方法（例如，独热编码）往往无法传达密切相关子类型之间的细微关系，限制了人工智能的语义理解。为了解决这一局限性，本研究提出了一种新颖的训练方法，采用大型语言模型（LLM）嵌入（例如，OpenAI GPT 和 Meta LLaMA）作为编码，以保留建筑语义中的更细微差别。我们通过训练 GraphSAGE 模型对五个高层住宅建筑信息模型（BIM）中的42个建筑对象子类型进行分类来评估所提方法。测试了多种嵌入维度，包括原始高维 LLM 嵌入（1,536、3,072 或 4,096）和通过 Matryoshka 表示模型生成的1,024维紧凑嵌入。实验结果表明，LLM 编码优于传统的独热编码基线，其中 llama-3（紧凑型）嵌入的加权平均 F1 分数为 0.8766，而独热编码为 0.8475。这些结果强调了利用基于 LLM 的编码来增强人工智能解读复杂领域特定建筑语义能力的前景。随着 LLM 和降维技术的不断发展，该方法在 AECO 行业的语义细化任务中具有广泛应用的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2602.15816

Developing AI Agents with Simulated Data: Why, what, and how?

利用模拟数据开发人工智能代理：为什么、什么以及如何？

Liu, Xiaoran, David, Istvan

Abstract

As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key concepts, benefits, and challenges of simulation-based synthetic data generation for AI training purposes, and to a reference framework to describe, design, and analyze digital twin-based AI simulation solutions.

Chinese Translation

由于数据量和质量不足仍然是现代符号外人工智能（subsymbolic AI）普及的主要障碍，因此合成数据生成技术的需求日益增加。模拟提供了一种适宜的、系统的方法来生成多样化的合成数据。本章向读者介绍了基于模拟的合成数据生成在人工智能训练中的关键概念、优势和挑战，以及一个参考框架，用于描述、设计和分析基于数字双胞胎（digital twin）的人工智能模拟解决方案。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.15034

EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research

EduResearchBench：一个用于全生命周期教育研究的分层原子任务分解基准

Yue, Houping, Di, Zixiang, Jiang, Mei, Li, Bingdong, Hao, Hao, Song, Yu, Jiang, Bo, Zhou, Aimin

Abstract

While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.

Chinese Translation

尽管大型语言模型（LLMs）正在重塑社会科学人工智能（AI4SS）的范式，但在学术写作中严格评估其能力仍然是一个重大挑战。现有基准主要强调单次生成和整体生成，因此缺乏反映复杂学术研究工作流程所需的细粒度评估。为填补这一空白，我们推出了EduResearchBench，这是第一个专门用于教育学术写作的综合评估平台。EduResearchBench基于我们的分层原子任务分解（HATD）框架，该框架将端到端研究工作流程分解为六个专门的研究模块（例如，定量分析、定性研究和政策研究），涵盖24个细粒度原子任务。这一分类法使得自动评估流程成为可能，缓解了整体评分的一个关键局限性，即汇总分数往往掩盖特定能力瓶颈，而是提供了关于具体缺陷的细粒度、诊断性反馈。此外，鉴于学术写作中固有的高认知负荷，我们提出了一种课程学习策略，逐步从基础技能构建到复杂的方法论推理和论证能力。利用55K份原始学术样本，我们策划了11K对高质量的指导对，以训练EduWrite，一个专门的教育学术写作模型。实验表明，EduWrite（30B）在多个核心指标上显著优于更大的一般模型（72B），证明在垂直领域中，数据质量密度和分层阶段训练课程比参数规模更具决定性。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.15038

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Indic-TunedLens：解读印度语言中的多语言模型

Panchal, Mihir, Varshney, Deeksha, Mamta, Ekbal, Asif

Abstract

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/AnonymousAccountACL/IndicTunedLens. Our code is available at https://github.com/AnonymousAccountACL/IndicTunedLens.

Chinese Translation

多语言大型语言模型（LLMs）在印度等语言多样性丰富的地区越来越多地被部署，但大多数可解释性工具仍然以英语为中心。先前的研究表明，LLMs通常在以英语为中心的表示空间中运作，这使得跨语言可解释性成为一个紧迫的问题。我们提出了Indic-TunedLens，一个专门针对印度语言的新型可解释性框架，它学习共享的仿射变换。与标准的Logit Lens直接解码中间激活不同，Indic-TunedLens为每个目标语言调整隐藏状态，将其与目标输出分布对齐，以实现对模型表示的更忠实解码。我们在10种印度语言上使用MMLU基准评估我们的框架，发现它在可解释性方法上显著优于现有的最先进（SOTA）方法，尤其是在形态丰富、资源稀缺的语言中。我们的结果为多语言变换器的层级语义编码提供了重要的见解。我们的模型可在https://huggingface.co/spaces/AnonymousAccountACL/IndicTunedLens获取。我们的代码可在https://github.com/AnonymousAccountACL/IndicTunedLens获取。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.15139

CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding

CGRA-DeBERTa 概念引导残差增强变换器用于伊斯兰神学理解

Hussain, Tahir, Khan, Saddam Hussain

Abstract

Accurate QA over classical Islamic texts remains challenging due to domain specific semantics, long context dependencies, and concept sensitive reasoning. Therefore, a new CGRA DeBERTa, a concept guided residual domain augmentation transformer framework, is proposed that enhances theological QA over Hadith corpora. The CGRA DeBERTa builds on a customized DeBERTa transformer backbone with lightweight LoRA based adaptations and a residual concept aware gating mechanism. The customized DeBERTa embedding block learns global and positional context, while Concept Guided Residual Blocks incorporate theological priors from a curated Islamic Concept Dictionary of 12 core terms. Moreover, the Concept Gating Mechanism selectively amplifies semantically critical tokens via importance weighted attention, applying differential scaling from 1.04 to 3.00. This design preserves contextual integrity, strengthens domain-specific semantic representations, and enables accurate, efficient span extraction while maintaining computational efficiency. This paper reports the results of training CGRA using a specially constructed dataset of 42591 QA pairs from the text of Sahih alBukhari and Sahih Muslim. While BERT achieved an EM score of 75.87 and DeBERTa one of 89.77, our model scored 97.85 and thus surpassed them by 8.08 on an absolute scale, all while adding approximately 8 inference overhead due to parameter efficient gating. The qualitative evaluation noted better extraction and discrimination and theological precision. This study presents Hadith QA systems that are efficient, interpretable, and accurate and that scale provide educational materials with necessary theological nuance.

Chinese Translation

由于领域特定的语义、长上下文依赖和概念敏感推理，针对经典伊斯兰文本的准确问答（QA）仍然具有挑战性。因此，提出了一种新的 CGRA DeBERTa 概念引导残差领域增强变换器框架，该框架增强了对圣训（Hadith）语料库的神学问答。CGRA DeBERTa 基于定制的 DeBERTa 变换器骨干，结合轻量级的 LoRA 基于适配和残差概念感知门控机制。定制的 DeBERTa 嵌入块学习全局和位置上下文，而概念引导残差块则结合了来自精心编纂的伊斯兰概念词典中 12 个核心术语的神学先验。此外，概念门控机制通过重要性加权注意力选择性地放大语义关键标记，应用从 1.04 到 3.00 的差异缩放。该设计保持了上下文的完整性，加强了领域特定的语义表示，并在保持计算效率的同时，实现了准确、高效的跨度提取。本文报告了使用从《萨希赫·布哈里》（Sahih al-Bukhari）和《萨希赫·穆斯林》（Sahih Muslim）文本中构建的 42591 对问答对的特别数据集训练 CGRA 的结果。虽然 BERT 的 EM 得分为 75.87，DeBERTa 的得分为 89.77，但我们的模型得分为 97.85，因此在绝对尺度上超越了它们 8.08，同时由于参数高效的门控，增加了大约 8 的推理开销。定性评估显示了更好的提取和区分能力以及神学精确性。本研究展示了高效、可解释且准确的圣训问答系统，并能够为教育材料提供必要的神学细微差别。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.15190

AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking

AIC CTU@AVerImaTeC：用于图像-文本事实核查的双检索器RAG

Ullrich, Herbert, Drchal, Jan

Abstract

In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year's retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module. Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just $0.013 on average using GPT5.1 via OpenAI Batch API. Our system is also easy to reproduce and tweak, consisting of only three decoupled modules - a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 - which is why we suggest it as an accesible starting point for further experimentation. We publish its code and prompts, as well as our vector stores and insights into the scheme's running costs and directions for further improvement.

Chinese Translation

在本文中，我们展示了我们在AVerImaTeC共享任务中获得第三名的系统，该系统将我们去年的检索增强生成（RAG）管道与反向图像搜索（RIS）模块相结合。尽管系统设计简单，但我们的系统在每次事实核查中仅使用一次多模态大语言模型（LLM）调用，平均成本仅为0.013美元，且使用的是通过OpenAI Batch API的GPT5.1。我们的系统也易于复现和调整，仅由三个解耦模块组成——一个基于相似性搜索的文本检索模块、一个基于API访问的RIS的图像检索模块，以及一个使用GPT5.1的生成模块，因此我们建议将其作为进一步实验的可访问起点。我们发布了其代码和提示，以及我们的向量存储和对该方案运行成本的见解以及进一步改进的方向。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.15197

OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

OpaqueToolsBench：通过交互学习工具行为的细微差别

Hallinan, Skyler, Venkatesh, Thejas, Ren, Xiang, Karimireddy, Sai Praneeth, Paranjape, Ashwin, Zhang, Yuhao, Hessel, Jack

Abstract

Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.

Chinese Translation

工具调用对于大型语言模型（LLM）代理完成现实世界任务至关重要。尽管大多数现有基准假设工具简单且文档完善，但现实世界中的工具（例如，一般的“搜索”API）往往是不透明的，缺乏明确的最佳实践或失败模式。LLM代理能否通过交互并随后改善文档，在不透明工具的环境中提高其性能？为此，我们创建了OpaqueToolsBench，这是一个由三个不同任务导向环境组成的基准：一般函数调用、互动国际象棋游戏和长轨迹代理搜索。每个环境提供未明确指定的工具，模型必须学习有效使用这些工具以完成任务。OpaqueToolsBench上的结果表明，当工具不透明时，现有的自动文档工具的方法既昂贵又不可靠。为了解决这个问题，我们提出了一个简单的框架ToolObserver，通过观察工具调用轨迹中的执行反馈，迭代地完善工具文档。我们的方案在OpaqueToolsBench的各个数据集上优于现有方法，即使在相对困难的设置中。此外，对于测试时的工具探索设置，我们的方法也高效，消耗的总令牌数比最佳基线少3.5-7.5倍。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.15312

Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement

从文本中提取消费者洞察：一种基于大型语言模型的情感与评估测量方法

Ludwig, Stephan, Danaher, Peter J., Yang, Xiaohao, Lin, Yu-Ting, Abedin, Ehsan, Grewal, Dhruv, Du, Lan

Abstract

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice. This study introduces the Linguistic eXtractor (LX), a fine-tuned, large language model trained on consumer-authored text that also has been labeled with consumers' self-reported ratings of 16 consumption-related emotions and four evaluation constructs: trust, commitment, recommendation, and sentiment. LX consistently outperforms leading models, including GPT-4 Turbo, RoBERTa, and DeepSeek, achieving 81% macro-F1 accuracy on open-ended survey responses and greater than 95% accuracy on third-party-annotated Amazon and Yelp reviews. An application of LX to online retail data, using seemingly unrelated regression, affirms that review-expressed emotions predict product ratings, which in turn predict purchase behavior. Most emotional effects are mediated by product ratings, though some emotions, such as discontent and peacefulness, influence purchase directly, indicating that emotional tone provides meaningful signals beyond star ratings. To support its use, a no-code, cost-free, LX web application is available, enabling scalable analyses of consumer-authored text. In establishing a new methodological foundation for consumer perception measurement, this research demonstrates new methods for leveraging large language models to advance marketing research and practice, thereby achieving validated detection of marketing constructs from consumer data.

Chinese Translation

准确测量消费者在非结构化文本中的情感和评估仍然是市场研究和实践中的核心挑战。本研究介绍了语言提取器（Linguistic eXtractor, LX），这是一种经过微调的大型语言模型，训练于消费者创作的文本，并标注了消费者自我报告的16种与消费相关的情感及四个评估构念：信任、承诺、推荐和情感。LX在开放式调查问卷的回答中始终优于领先模型，包括GPT-4 Turbo、RoBERTa和DeepSeek，达到了81%的宏观F1准确率，并在第三方标注的亚马逊和Yelp评论中实现了超过95%的准确率。LX在在线零售数据中的应用，采用看似无关的回归分析，确认了评论中表达的情感能够预测产品评分，而产品评分又能预测购买行为。大多数情感效应通过产品评分中介，尽管一些情感，如不满和宁静，直接影响购买，表明情感基调提供了超越星级评分的有意义信号。为了支持其使用，提供了一个无代码、无成本的LX网络应用程序，使得对消费者创作文本的可扩展分析成为可能。在为消费者感知测量建立新的方法论基础的过程中，本研究展示了利用大型语言模型推动市场研究和实践的新方法，从而实现了对消费者数据中市场构念的有效检测。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.15313

Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory

Mnemis：在层次图上进行双路检索以实现长期大语言模型记忆

Tang, Zihao, Yu, Xin, Xiao, Ziyu, Wen, Zengxuan, Li, Zelin, Zhou, Jiaxi, Wang, Hualei, Wang, Haohua, Huang, Haizhen, Deng, Weiwei, Sun, Feng, Zhang, Qi

Abstract

AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.

Chinese Translation

人工智能记忆，特别是模型如何组织和检索历史消息，变得对大型语言模型（LLMs）越来越重要。然而，现有的方法（如RAG和Graph-RAG）主要通过基于相似性的机制来检索记忆。尽管这种系统1风格的检索效率较高，但在需要全局推理或全面覆盖所有相关信息的场景中表现不佳。在本研究中，我们提出了Mnemis，这是一种新颖的记忆框架，结合了系统1的相似性搜索和一种称为全局选择（Global Selection）的互补系统2机制。Mnemis将记忆组织为一个基础图以进行相似性检索，以及一个层次图以实现对语义层次的自上而下的深思熟虑遍历。通过结合两条检索路径的互补优势，Mnemis能够检索到在语义和结构上都相关的记忆项。在长期记忆基准测试中，Mnemis在所有比较方法中实现了最先进的性能，在使用GPT-4.1-mini时，LoCoMo得分为93.9，LongMemEval-S得分为91.6。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.15353

NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering

NeuroSymActive：具有主动探索的可微分神经符号推理用于知识图谱问答

Fu, Rong, Li, Yang, Zhang, Zeyu, Wu, Jiekai, Liu, Yaohua, Cao, Shuaishuai, Zeng, Yangchen, Zhang, Yuhang, Du, Xiaojing, Zhao, Chuang, Cui, Kangning, Fong, Simon

Abstract

Large pretrained language models and neural reasoning systems have advanced many natural language tasks, yet they remain challenged by knowledge-intensive queries that require precise, structured multi-hop inference. Knowledge graphs provide a compact symbolic substrate for factual grounding, but integrating graph structure with neural models is nontrivial: naively embedding graph facts into prompts leads to inefficiency and fragility, while purely symbolic or search-heavy approaches can be costly in retrievals and lack gradient-based refinement. We introduce NeuroSymActive, a modular framework that combines a differentiable neural-symbolic reasoning layer with an active, value-guided exploration controller for Knowledge Graph Question Answering. The method couples soft-unification style symbolic modules with a neural path evaluator and a Monte-Carlo style exploration policy that prioritizes high-value path expansions. Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.

Chinese Translation

大型预训练语言模型和神经推理系统在许多自然语言任务中取得了进展，但在需要精确、结构化的多跳推理的知识密集型查询中仍面临挑战。知识图谱提供了一个紧凑的符号基础，用于事实基础，但将图结构与神经模型结合并非易事：简单地将图事实嵌入提示中会导致效率低下和脆弱性，而纯符号或重搜索的方法在检索上可能成本高昂且缺乏基于梯度的细化。我们提出了NeuroSymActive，一个模块化框架，结合了可微分的神经符号推理层和一个主动的、价值引导的探索控制器，用于知识图谱问答。该方法将软统一风格的符号模块与神经路径评估器和优先扩展高价值路径的蒙特卡洛风格探索策略相结合。在标准KGQA基准上的实证结果表明，NeuroSymActive在提高答案准确性的同时，相比于常见的检索增强基线，减少了昂贵的图查找和模型调用次数。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.15373

Far Out: Evaluating Language Models on Slang in Australian and Indian English

远离常规：评估语言模型在澳大利亚英语和印度英语中的俚语表现

Dilsiz, Deniz Kaya, Srirag, Dipankar, Joshi, Aditya

Abstract

Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

Chinese Translation

语言模型在处理非标准语言变体的文本时表现出系统性的性能差距，但它们对特定变体俚语的理解能力在多种语言中仍未得到充分探索。我们对七种最先进的语言模型在印度英语（en-IN）和澳大利亚英语（en-AU）中的俚语意识进行了全面评估。我们构建了两个互补的数据集： extsc{web}，包含来自Urban Dictionary的377个网络来源的用例，以及 extsc{gen}，包含在多种场景下合成生成的1,492个俚语使用示例。我们在三个任务上评估语言模型：目标词预测（TWP）、引导目标词预测（TWP$^*$）和目标词选择（TWS）。我们的结果揭示了四个关键发现：（1）模型在TWS任务上的平均表现高于TWP和TWP$^*$，平均准确率从0.03提高到0.49；（2）模型在 extsc{web}数据集上的平均表现强于 extsc{gen}数据集，在TWP和TWP$^*$任务中的平均相似度得分分别提高了0.03和0.05；（3）在所有模型和数据集的平均表现中，en-IN任务优于en-AU任务，TWS显示出最大的差异，平均准确率从0.44提高到0.54。这些发现强调了生成能力和判别能力在特定变体语言之间的基本不对称性，尤其是在俚语表达的背景下，尽管英语是一种技术丰富的语言。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.15377

Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework

无编排的客户服务自动化：一种隐私保护和流程图引导的框架

Hong, Mengze, Zhang, Chen Jason, Guo, Zichang, Gu, Hanlin, Jiang, Di, Qing, Li

Abstract

Customer service automation has seen growing demand within digital transformation. Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability. This paper introduces an orchestration-free framework using Task-Oriented Flowcharts (TOFs) to enable end-to-end automation without manual intervention. We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues. We emphasize local deployment of small language models and propose decentralized distillation with flowcharts to mitigate data scarcity and privacy issues in model training. Extensive experiments validate the effectiveness in various service tasks, with superior quantitative and application performance compared to strong baselines and market products. By releasing a web-based system demonstration with case studies, we aim to promote streamlined creation of future service automation.

Chinese Translation

客户服务自动化在数字化转型中需求日益增长。现有方法要么依赖于具有广泛代理编排的模块化系统设计，要么采用过于简化的指令模式，提供有限的指导和较差的通用性。本文提出了一种无编排的框架，利用任务导向流程图（Task-Oriented Flowcharts, TOFs）实现端到端的自动化，无需人工干预。我们首先定义了TOFs的组件和评估指标，然后形式化了一种成本高效的流程图构建算法，以从服务对话中抽象程序知识。我们强调小型语言模型的本地部署，并提出利用流程图进行去中心化蒸馏，以缓解模型训练中的数据稀缺和隐私问题。大量实验验证了在各种服务任务中的有效性，与强基线和市场产品相比，表现出更优的定量和应用性能。通过发布一个基于网络的系统演示和案例研究，我们旨在促进未来服务自动化的简化创建。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.15378

Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language

让大型语言模型说图卢语：针对极低资源语言的结构化提示

Devadiga, Prathamesh, Chopra, Paras

Abstract

Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis reveals that negative constraints provide consistent improvements (12--18 percentage points), while grammar documentation effects vary by model architecture (8--22 points).

Chinese Translation

大型语言模型能否在几乎没有训练数据的语言中进行对话？我们通过对图卢语的案例研究来探讨这个问题。图卢语是一种德拉威语，拥有超过200万的使用者，但在数字领域几乎没有存在。我们并不对大型语言模型（LLM）进行微调，而是考察仅通过结构化提示是否能够在受控提示下引发基本的对话能力。我们系统性地解决了图卢语缺乏训练数据所带来的各种挑战，结合了明确的语法文档、抑制相关语言高概率词汇的负约束、罗马化标准化以及通过自我对弈生成的质量控制合成数据。我们在三个大型语言模型（Gemini 2.0 Flash、GPT-4o、Llama 3.1 70B）上评估了手动策划的保留集，并由母语者验证，我们的方法将词汇污染从80%降低到5%，同时实现了85%的语法准确率。跨模型分析表明，负约束提供了一致的改进（12-18个百分点），而语法文档的效果因模型架构而异（8-22个百分点）。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.15382

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞：异构多智能体系统中的潜在空间通信

Liu, Xiaoze, Zhang, Ruowang, Yu, Weichen, Xiong, Siheng, He, Liu, Wu, Feijie, Jung, Hoin, Fredrikson, Matt, Wang, Xiaoqian, Gao, Jing

Abstract

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas

Chinese Translation

由大型语言模型驱动的多智能体系统（MAS）解锁了先进的协同推理能力，但仍然受限于离散文本通信的低效性，这带来了显著的运行时开销和信息量化损失。虽然潜在状态转移提供了一种高带宽的替代方案，但现有方法要么假设发送者-接收者架构是同质的，要么依赖于特定对的学习翻译器，这限制了在具有不相交流形的多样模型家族中的可扩展性和模块化。在本研究中，我们提出了视觉虫洞（Vision Wormhole），一个新颖的框架，重新利用视觉语言模型（VLMs）的视觉接口，以实现模型无关、无文本的通信。通过引入通用视觉编码器（Universal Visual Codec），我们将异构推理轨迹映射到共享的连续潜在空间，并直接注入到接收者的视觉通路中，有效地将视觉编码器视为智能体间心灵感应的通用端口。我们的框架采用中心-辐射拓扑，将成对对齐的复杂度从 O(N^2) 降低到 O(N)，并利用无标签的教师-学生蒸馏目标，将高速视觉通道与文本通路的稳健推理模式对齐。针对异构模型家族（例如 Qwen-VL、Gemma）进行的广泛实验表明，视觉虫洞在控制比较中减少了端到端的实际时间，同时保持了与标准基于文本的 MAS 相当的推理保真度。代码可在 https://github.com/xz-liu/heterogeneous-latent-mas 获取。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.15436

Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs

通过参与测量社会融合：使用大型语言模型对流亡卡累利亚人访谈档案中的组织和休闲活动进行分类

Laato, Joonatan, Schroderus, Veera, Kanerva, Jenna, Kauppi, Jenni, Lummaa, Virpi, Ginter, Filip

Abstract

Digitized historical archives make it possible to study everyday social life on a large scale, but the information extracted directly from text often does not directly allow one to answer the research questions posed by historians or sociologists in a quantitative manner. We address this problem in a large collection of Finnish World War II Karelian evacuee family interviews. Prior work extracted more than 350K mentions of leisure time activities and organizational memberships from these interviews, yielding 71K unique activity and organization names -- far too many to analyze directly. We develop a categorization framework that captures key aspects of participation (the kind of activity/organization, how social it typically is, how regularly it happens, and how physically demanding it is). We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale. Using a simple voting approach across multiple model runs, we find that an open-weight LLM can closely match expert judgments. Finally, we apply the method to label the 350K entities, producing a structured resource for downstream studies of social integration and related outcomes.

Chinese Translation

数字化历史档案使得大规模研究日常社会生活成为可能，但从文本中直接提取的信息往往无法以定量方式直接回答历史学家或社会学家提出的研究问题。我们在大量芬兰第二次世界大战卡累利亚撤离家庭访谈中解决了这一问题。之前的研究从这些访谈中提取了超过35万次休闲活动和组织成员资格的提及，产生了71,000个独特的活动和组织名称——数量过多，无法直接分析。我们开发了一个分类框架，捕捉参与的关键方面（活动/组织的类型、通常的社交程度、发生的频率以及身体要求的强度）。我们注释了一个标准集，以便进行可靠的评估，然后测试大型语言模型是否能够在规模上应用相同的框架。通过在多个模型运行中使用简单的投票方法，我们发现开放权重的LLM可以与专家判断相匹配。最后，我们将该方法应用于标记35万个实体，生成了一个结构化资源，以供后续研究社会融合及相关结果使用。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.15449

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

TAROT：基于测试驱动和能力自适应的课程强化微调用于大型语言模型的代码生成

Park, Chansung, Jiang, Juyong, Wang, Fan, Paul, Sayak, Shen, Jiasi, Tang, Jing, Li, Jianguo

Abstract

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

Chinese Translation

大型语言模型（LLMs）正在改变编码范式，被称为氛围编码，但合成算法复杂且稳健的代码仍然是一个关键挑战。激励LLMs的深度推理能力对于克服这一障碍至关重要。强化微调（RFT）已成为应对这一需求的有前景的策略。然而，大多数现有方法忽视了测试用例固有的异质性难度和粒度，导致奖励信号的分布不平衡，从而在训练过程中产生偏向的梯度更新。为了解决这个问题，我们提出了基于测试驱动和能力自适应的课程强化微调（TAROT）。TAROT为每个问题系统地构建了一个四层测试套件（基本、中级、复杂、边缘），为课程设计和评估提供了一个受控的难度景观。关键是，TAROT将课程进展与原始奖励分数解耦，使得能够基于能力进行评估，并从课程策略组合中进行原则性选择，而不是偶然的测试用例难度组合。这一设计促进了稳定的优化和更高效的能力获取。大量实验结果表明，RFT在代码生成中的最佳课程与模型的固有能力密切相关，能力较弱的模型在简单到困难的进展中获得更大的收益，而更有能力的模型在困难优先的课程下表现更佳。TAROT提供了一种可重复的方法，能够自适应地根据模型的能力调整课程设计，从而持续提高生成代码的功能正确性和稳健性。所有代码和数据均已发布，以促进可重复性并推动社区研究，网址为 https://github.com/deep-diver/TAROT。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.15456

In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations

我们信任代理，但代理信任谁？潜在来源偏好引导大语言模型生成

Khan, Mohammad Aflah, Amani, Mahsa, Das, Soumi, Ghosh, Bishwamittra, Wu, Qinyuan, Gummadi, Krishna P., Gupta, Manish, Ravichander, Abhilasha

Abstract

Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users' attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors that influence what information LLMs select and present to users. We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), current LLMs exhibit systematic latent source preferences- that is, they prioritize information from some sources over others. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real-world tasks, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left-leaning skew in news recommendations in prior work. Our findings advocate for deeper investigation into the origins of these preferences, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM-powered agents.

Chinese Translation

基于大语言模型（LLMs）的代理越来越多地被部署为在线平台信息的接口。这些代理过滤、优先排序并综合从平台后端数据库或通过网络搜索获取的信息。在这些场景中，LLM代理主导用户接收到的信息，通过将用户的注意力引导至特定的信息实例，而忽略其他信息。尽管之前的许多研究集中于LLM自身生成的信息偏见，但对影响LLM选择和呈现给用户的信息的因素关注较少。我们假设，当信息被归因于特定来源（例如，特定出版商、期刊或平台）时，当前的LLM表现出系统性的潜在来源偏好——即它们优先考虑来自某些来源的信息。通过对六个模型提供商的十二个LLM进行控制实验，涵盖合成和现实世界任务，我们发现多个模型一致表现出强烈且可预测的来源偏好。这些偏好对上下文框架敏感，能够超越内容本身的影响，并且在明确提示避免这些偏好时仍然存在。它们还帮助解释了先前研究中观察到的新闻推荐的左倾偏差现象。我们的发现倡导对这些偏好的起源进行更深入的研究，并呼吁建立机制，以便为用户提供透明度和对引导LLM驱动代理的偏见的控制。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.15504

Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit

朝向语言中的期望检测：关于Reddit上治疗期望的案例研究

Velutharambath, Aswathy, Wührl, Amelie

Abstract

Patients' expectations towards their treatment have a substantial effect on the treatments' success. While primarily studied in clinical settings, online patient platforms like medical subreddits may hold complementary insights: treatment expectations that patients feel unnecessary or uncomfortable to share elsewhere. Despite this, no studies examine what type of expectations users discuss online and how they express them. Presumably this is because expectations have not been studied in natural language processing (NLP) before. Therefore, we introduce the task of Expectation Detection, arguing that expectations are relevant for many applications, including opinion mining and product design. Subsequently, we present a case study for the medical domain, where expectations are particularly crucial to extract. We contribute RedHOTExpect, a corpus of Reddit posts (4.5K posts) to study expectations in this context. We use a large language model (LLM) to silver-label the data and validate its quality manually (label accuracy ~78%). Based on this, we analyze which linguistic patterns characterize expectations and explore what patients expect and why. We find that optimism and proactive framing are more pronounced in posts about physical or treatment-related illnesses compared to mental-health contexts, and that in our dataset, patients mostly discuss benefits rather than negative outcomes. The RedHOTExpect corpus can be obtained from https://www.ims.uni-stuttgart.de/data/RedHOTExpect

Chinese Translation

患者对其治疗的期望对治疗的成功有着显著影响。虽然这一主题主要在临床环境中进行研究，但像医疗子版块这样的在线患者平台可能提供补充性的见解：患者在其他地方感到不必要或不舒服分享的治疗期望。尽管如此，目前尚无研究探讨用户在线讨论的期望类型及其表达方式。这可能是因为期望在自然语言处理（NLP）领域尚未被研究。因此，我们引入期望检测这一任务，认为期望对许多应用（包括意见挖掘和产品设计）具有重要意义。随后，我们呈现一个医疗领域的案例研究，在该领域中，提取期望尤为关键。我们贡献了RedHOTExpect，一个包含4.5K条Reddit帖子的数据集，以研究该背景下的期望。我们使用大型语言模型（LLM）对数据进行初步标注，并手动验证其质量（标注准确率约为78%）。基于此，我们分析了哪些语言模式特征化期望，并探讨患者的期望及其原因。我们发现，与心理健康背景相比，关于身体或治疗相关疾病的帖子中，乐观和积极的框架更加明显，并且在我们的数据集中，患者主要讨论益处而非负面结果。RedHOTExpect语料库可从https://www.ims.uni-stuttgart.de/data/RedHOTExpect获取。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.15506

LuxMT Technical Report

LuxMT 技术报告

Rehlinger, Nils

Abstract

We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT's results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder's potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric's utility and advise using it with caution.

Chinese Translation

我们介绍了 LuxMT，一个基于 Gemma 3 27B 的机器翻译系统，经过微调以实现卢森堡语（LB）到法语（FR）和英语（EN）的翻译。为了评估翻译性能，我们构建了一个新颖的基准，涵盖 LB-FR、LB-EN 和 LB-FR，使用来自 Luci 的人类翻译数据，Luci 是一本关于卢森堡的旅游杂志。训练数据来自 LuxAlign，一个多语言卢森堡新闻文章的平行语料库，以及与 Google Translate 增强的 LB 议会记录。我们使用 LuxEmbedder（LB 句子嵌入）过滤数据，以去除低等价的段落对。总体而言，LuxMT 的结果表明，相较于 Gemma 3 基线，翻译性能有显著提升，即使在将 LB 翻译为德语（DE）时也表现良好，尽管训练数据中并未包含任何 DE。我们还探讨了 LuxEmbedder 作为质量评估指标的潜力，并发现其与其他基于参考的指标之间存在强相关性。然而，我们呼吁进一步研究以全面评估该指标的实用性，并建议谨慎使用。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.15509

Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination

精细精炼：迭代细粒度精炼以减轻对话幻觉

Chen, Xiangyan, Gan, Yujian, Purver, Matthew

Abstract

The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems. Such hallucinations produce factually incorrect responses that may mislead users and undermine system trust. Existing refinement methods for dialogue systems typically operate at the response level, overlooking the fact that a single response may contain multiple verifiable or unverifiable facts. To address this gap, we propose Fine-Refine, a fine-grained refinement framework that decomposes responses into atomic units, verifies each unit using external knowledge, assesses fluency via perplexity, and iteratively corrects granular errors. We evaluate factuality across the HybriDialogue and OpendialKG datasets in terms of factual accuracy (fact score) and coverage (Not Enough Information Proportion), and experiments show that Fine-Refine substantially improves factuality, achieving up to a 7.63-point gain in dialogue fact score, with a small trade-off in dialogue quality.

Chinese Translation

当前大型语言模型（LLMs）中幻觉的倾向对对话系统产生了负面影响。这种幻觉会产生事实不准确的回应，可能误导用户并削弱系统的信任。现有的对话系统精炼方法通常在回应层面进行操作，忽视了单一回应可能包含多个可验证或不可验证的事实。为了解决这一问题，我们提出了Fine-Refine，一个细粒度精炼框架，它将回应分解为原子单元，利用外部知识验证每个单元，通过困惑度评估流畅性，并迭代修正细微错误。我们在HybriDialogue和OpendialKG数据集上评估了事实性，关注事实准确性（事实分数）和覆盖率（信息不足比例），实验表明Fine-Refine显著提高了事实性，在对话事实分数上实现了高达7.63点的提升，同时对话质量的折衷较小。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.15514

DependencyAI: Detecting AI Generated Text through Dependency Parsing

DependencyAI：通过依存解析检测AI生成文本

Ahmed, Sara, Hammond, Tracy

Abstract

As large language models (LLMs) become increasingly prevalent, reliable methods for detecting AI-generated text are critical for mitigating potential risks. We introduce DependencyAI, a simple and interpretable approach for detecting AI-generated text using only the labels of linguistic dependency relations. Our method achieves competitive performance across monolingual, multi-generator, and multilingual settings. To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text. We also observe a systematic overprediction of certain models on unseen domains, suggesting that generator-specific writing styles may affect cross-domain generalization. Overall, our results demonstrate that dependency relations alone provide a robust signal for AI-generated text detection, establishing DependencyAI as a strong linguistically grounded, interpretable, and non-neural network baseline.

Chinese Translation

随着大型语言模型（LLMs）的日益普及，可靠的AI生成文本检测方法对于减轻潜在风险至关重要。我们提出了DependencyAI，这是一种简单且可解释的方法，仅使用语言依存关系的标签来检测AI生成的文本。我们的方法在单语、多生成器和多语言环境中均表现出竞争力。为了提高可解释性，我们分析了特征重要性，以揭示区分AI生成文本与人类撰写文本的句法结构。我们还观察到某些模型在未见领域上的系统性过度预测，这表明生成器特定的写作风格可能会影响跨领域的泛化能力。总体而言，我们的结果表明，仅依赖依存关系就能为AI生成文本检测提供强有力的信号，确立了DependencyAI作为一个强大的、以语言学为基础的、可解释的非神经网络基线。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.15521

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

ExpertWeaver：利用GLU激活模式解锁稠密大规模语言模型中的固有专家混合机制

Zhao, Ziyu, Zhu, Tong, Zhang, Zhi, Fan, Tiantian, Yang, Jinluan, Kuang, Kun, Wei, Zhongyu, Wu, Fei, Cheng, Yu

Abstract

Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

Chinese Translation

专家混合模型（Mixture-of-Experts, MoE）通过稀疏的专家激活有效地扩展模型容量，同时保持计算效率。然而，从头训练高质量的MoE代价高昂。一种有前景的替代方案是将预训练的稠密模型转换为稀疏的MoE。现有的稠密到MoE的方法分为两类： extbf{动态结构剪枝}，将稠密模型转换为具有适度稀疏性的MoE架构，以平衡性能和推理效率；以及 extbf{下循环}方法，利用预训练的稠密模型初始化高度稀疏的MoE架构。然而，现有方法破坏了稠密模型内部的激活模式，导致专家构建的次优。在本研究中，我们认为门控线性单元（Gated Linear Unit, GLU）机制为稠密到MoE的转换提供了自然的蓝图。我们展示了GLU的细粒度神经元激活模式揭示了一种粗粒度结构，揭示了由持续激活的通用神经元和动态激活的专业神经元组成的固有MoE架构。利用这一发现，我们引入了ExpertWeaver，一个无训练的框架，根据神经元的激活模式对其进行分区，并构建具有层自适应配置的共享专家和专业路由专家。我们的实验表明，ExpertWeaver在作为无训练的动态结构剪枝技术和作为优越的MoE初始化下循环策略方面，显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.15537

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

ZeroSyl：一种简单的零资源音节分词方法用于口语语言建模

Visser, Nicol, Malan, Simon, Slabbert, Danel, Kamper, Herman

Abstract

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

Chinese Translation

纯语音语言模型旨在直接从原始音频中学习语言，而无需文本资源。一个主要挑战是，自监督语音编码器生成的离散标记导致序列过长，这促使了近期对类似音节单元的研究。然而，像Sylber和SyllableLM的方法依赖于复杂的多阶段训练流程。我们提出了ZeroSyl，这是一种简单的无训练方法，能够直接从冻结的WavLM模型中提取音节边界和嵌入。通过使用WavLM中间层特征的L2范数，ZeroSyl实现了具有竞争力的音节分割性能。生成的片段经过均值池化，使用K-means进行离散化，并用于训练语言模型。ZeroSyl在词汇、句法和叙事基准测试中优于之前的音节分词器。扩展实验表明，尽管更细粒度的单元对词汇任务有利，但我们发现的音节单元在句法建模中表现出更好的扩展性。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.15540

Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite

Perspectives - 话语分析工具套件中的互动文档聚类

Fischer, Tim, Biemann, Chris

Abstract

This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities. We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives's interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.

Chinese Translation

本文介绍了 Perspectives，这是一个话语分析工具套件的互动扩展，旨在帮助数字人文学科（Digital Humanities, DH）学者探索和组织大量非结构化文档集合。Perspectives 实现了一个灵活的、以方面为中心的文档聚类流程，并具有人机协作的精细化能力。我们展示了如何通过文档重写提示和基于指令的嵌入定义分析视角来初步引导这一过程，并通过聚类精细化工具和嵌入模型微调机制进一步与用户意图对齐。该演示突出了一个典型的工作流程，展示了 DH 研究人员如何利用 Perspectives 的互动文档地图来发现主题、情感或其他相关类别，从而获得洞察并为后续深入分析准备数据。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.15547

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

jina-embeddings-v5-text：任务导向的嵌入蒸馏

Akram, Mohammad Kalim, Sturua, Saba, Havriushenko, Nastia, Herreros, Quentin, Günther, Michael, Werk, Maximilian, Xiao, Han

Abstract

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

Chinese Translation

文本嵌入模型广泛应用于语义相似性任务，包括信息检索、聚类和分类。通用模型通常使用对比损失函数通过单阶段或多阶段过程进行训练。我们提出了一种新颖的训练方案，将模型蒸馏技术与任务特定的对比损失相结合，以生成紧凑且高性能的嵌入模型。我们的研究结果表明，这种方法在训练小型模型时比单纯的对比或基于蒸馏的训练范式更为有效。所得到的模型，jina-embeddings-v5-text-small 和 jina-embeddings-v5-text-nano，在相似规模的模型中，其基准得分超过或匹配了当前的最先进水平。此外，jina-embeddings-v5-text 模型还支持多种语言的长文本（最多 32k 个标记），并生成在截断和二进制量化下依然稳健的嵌入。模型权重已公开发布，希望能激励嵌入模型开发的进一步进展。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.15564

Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL

超越静态管道：学习动态工作流用于文本到SQL

Wang, Yihan, Liu, Peiyu, Chen, Runyu, Xu, Wei

Abstract

Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at https://github.com/Satissss/SquRL

Chinese Translation

文本到SQL最近取得了显著进展，但在实际场景中的有效应用仍然困难。这一差距源于对单一静态工作流的依赖，根本上限制了在分布外和长尾场景中的可扩展性。我们尝试使系统能够在推理时自适应地构建工作流，而不是要求用户通过广泛的实验选择合适的方法。通过理论和实证分析，我们证明了最优动态策略始终优于最佳静态工作流，其性能提升主要源于候选工作流之间的异质性。基于此，我们提出了SquRL，一个增强LLM（大语言模型）在自适应工作流构建中推理能力的强化学习框架。我们设计了一种基于规则的奖励函数，并引入了两种有效的训练机制：动态演员掩蔽以鼓励更广泛的探索，以及伪奖励以提高训练效率。在广泛使用的文本到SQL基准测试中的实验表明，动态工作流构建始终优于最佳静态工作流方法，尤其在复杂和分布外查询上表现出显著的提升。代码可在 https://github.com/Satissss/SquRL 获取。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.15578

Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations

基于临床启发的症状引导抑郁检测：来自情感感知语音表示的研究

Nerella, Chaithra, Yarra, Chiranjeevi

Abstract

Depression manifests through a diverse set of symptoms such as sleep disturbance, loss of interest, and concentration difficulties. However, most existing works treat depression prediction either as a binary label or an overall severity score without explicitly modeling symptom-specific information. This limits their ability to provide symptom-level analysis relevant to clinical screening. To address this, we propose a symptom-specific and clinically inspired framework for depression severity estimation from speech. Our approach uses a symptom-guided cross-attention mechanism that aligns PHQ-8 questionnaire items with emotion-aware speech representations to identify which segments of a participant's speech are more important to each symptom. To account for differences in how symptoms are expressed over time, we introduce a learnable symptom-specific parameter that adaptively controls the sharpness of attention distributions. Our results on EDAIC, a standard clinical-style dataset, demonstrate improved performance outperforming prior works. Further, analyzing the attention distributions showed that higher attention is assigned to utterances containing cues related to multiple depressive symptoms, highlighting the interpretability of our approach. These findings outline the importance of symptom-guided and emotion-aware modeling for speech-based depression screening.

Chinese Translation

抑郁症通过多种症状表现出来，如睡眠障碍、兴趣丧失和注意力困难。然而，大多数现有研究将抑郁预测视为二元标签或整体严重性评分，而未明确建模症状特定信息。这限制了它们在提供与临床筛查相关的症状级分析方面的能力。为了解决这个问题，我们提出了一种基于症状特定和临床启发的抑郁严重性估计框架，该框架利用语音数据。我们的方法使用症状引导的交叉注意机制，将PHQ-8问卷项目与情感感知语音表示对齐，以识别参与者语音中与每个症状更相关的片段。为了考虑症状在时间上表达的差异，我们引入了一个可学习的症状特定参数，该参数自适应地控制注意力分布的锐度。我们在标准临床风格数据集EDAIC上的结果表明，性能得到了改善，超越了之前的研究。此外，对注意力分布的分析显示，对包含与多种抑郁症状相关线索的发言分配了更高的注意力，突显了我们方法的可解释性。这些发现强调了症状引导和情感感知建模在基于语音的抑郁筛查中的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.15620

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

STAPO：通过抑制稀有伪令牌来稳定大型语言模型的强化学习

Liu, Shiqi, He, Zeyu, Zhan, Guojian, Tao, Letian, Zheng, Zhilong, Wu, Jiang, Wang, Yinuo, Guan, Yang, Sheng, Kehua, Zhang, Bo, Li, Keqiang, Duan, Jingliang, Li, Shengbo Eben

Abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.

Chinese Translation

强化学习（RL）显著提升了大型语言模型的推理能力，但现有的RL微调方法在维持稳定性方面过于依赖启发式技术，如熵正则化和重加权。在实践中，它们往往会经历后期性能崩溃，导致推理质量下降和训练不稳定。我们推导出，RL中逐词策略梯度的大小与令牌概率和局部策略熵呈负相关。基于这一结果，我们证明训练不稳定性是由一小部分令牌驱动的，约占0.01\%，我们称之为伪令牌（spurious tokens）。当这些令牌出现在正确的响应中时，它们对推理结果的贡献微乎其微，但却继承了完整的序列级奖励，导致异常放大的梯度更新。受此观察的启发，我们提出了伪令牌感知策略优化（Spurious-Token-Aware Policy Optimization, STAPO）用于大规模模型的精炼，该方法选择性地屏蔽这些更新，并对有效令牌的损失进行重新归一化。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准测试中，STAPO始终展现出优越的熵稳定性，并在GRPO、20-Entropy和JustRL的基础上实现了平均7.13 ext{%}的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.15675

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

LLM到语音：用于训练方言文本到语音模型的合成数据管道

Khamis, Ahmed Khaled, Ali, Hesham

Abstract

Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.

Chinese Translation

尽管神经文本到语音（TTS）技术取得了进展，但许多阿拉伯方言仍然被边缘化，大多数资源集中在现代口语阿拉伯语（MSA）和海湾方言上，导致埃及阿拉伯语——最广泛理解的阿拉伯方言——严重缺乏资源。我们通过引入NileTTS来填补这一空白：该数据集包含来自两位讲者的38小时转录语音，涵盖医疗、销售和一般对话等多个领域。我们使用一种新颖的合成管道构建该数据集：大型语言模型（LLM）生成埃及阿拉伯语内容，然后使用音频合成工具将其转换为自然语音，接着进行自动转录和说话者分离，并经过人工质量验证。我们在该数据集上微调了XTTS v2，一个最先进的多语言TTS模型，并与在其他阿拉伯方言上训练的基线模型进行评估。我们的贡献包括：（1）首个公开可用的埃及阿拉伯语TTS数据集，（2）可重复的方言TTS合成数据生成管道，以及（3）一个开源的微调模型。所有资源的发布旨在推动埃及阿拉伯语语音合成研究。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.15678

Revisiting Northrop Frye's Four Myths Theory with Large Language Models

重新审视诺思罗普·弗莱的四种神话理论与大型语言模型

de Lima, Edirlei Soares, Casanova, Marco A., Furtado, Antonio L.

Abstract

Northrop Frye's theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather than character functions. In this paper, we present a new character function framework that complements pattern-based analysis by examining how archetypal roles manifest differently across Frye's genres. Drawing on Jungian archetype theory, we derive four universal character functions (protagonist, mentor, antagonist, companion) by mapping them to Jung's psychic structure components. These functions are then specialized into sixteen genre-specific roles based on prototypical works. To validate this framework, we conducted a multi-model study using six state-of-the-art Large Language Models (LLMs) to evaluate character-role correspondences across 40 narrative works. The validation employed both positive samples (160 valid correspondences) and negative samples (30 invalid correspondences) to evaluate whether models both recognize valid correspondences and reject invalid ones. LLMs achieved substantial performance (mean balanced accuracy of 82.5%) with strong inter-model agreement (Fleiss' $\kappa$ = 0.600), demonstrating that the proposed correspondences capture systematic structural patterns. Performance varied by genre (ranging from 72.7% to 89.9%) and role (52.5% to 99.2%), with qualitative analysis revealing that variations reflect genuine narrative properties, including functional distribution in romance and deliberate archetypal subversion in satire. This character-based approach demonstrates the potential of LLM-supported methods for computational narratology and provides a foundation for future development of narrative generation methods and interactive storytelling applications.

Chinese Translation

诺思罗普·弗莱的四种基本叙事类型理论（喜剧、浪漫、悲剧、讽刺）对文学批评产生了深远的影响，但对其框架的计算方法主要集中在叙事模式而非角色功能上。本文提出了一种新的角色功能框架，通过考察原型角色在弗莱的不同类型中如何表现，来补充基于模式的分析。基于荣格的原型理论，我们将四种普遍角色功能（主角、导师、反派、伴侣）映射到荣格的心理结构组件。随后，这些功能根据典型作品被细分为十六种特定类型的角色。为了验证该框架，我们使用六种最先进的大型语言模型（LLMs）进行了多模型研究，以评估40部叙事作品中的角色-角色对应关系。验证过程中使用了正样本（160个有效对应关系）和负样本（30个无效对应关系），以评估模型是否能够识别有效对应关系并拒绝无效的对应关系。LLMs在性能上取得了显著的结果（平均平衡准确率为82.5%），并且模型间一致性强（Fleiss' $ ext{kappa}$ = 0.600），证明所提出的对应关系捕捉到了系统性的结构模式。性能因类型（范围从72.7%到89.9%）和角色（52.5%到99.2%）而异，定性分析显示这些变化反映了真实的叙事特性，包括浪漫中的功能分布和讽刺中的故意原型颠覆。这种基于角色的方法展示了LLM支持的计算叙事学方法的潜力，并为未来叙事生成方法和互动故事应用的发展奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.15689

A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

基于内容的网络安全拒绝决策框架在大型语言模型中的应用

Segal, Meirav, Linder, Noa, Antverg, Omer, Gekker, Gil, Fichman, Tomer, Bodenheimer, Omri, Maor, Edan, Nevo, Omer

Abstract

Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

Chinese Translation

大型语言模型及基于LLM的代理在网络安全任务中越来越多地被使用，而这些任务本质上具有双重用途。现有的拒绝策略，包括学术政策框架和商业部署系统，通常依赖于广泛的主题禁令或以攻击为中心的分类法。因此，它们可能导致不一致的决策，过度限制合法防御者，并在模糊化或请求分段的情况下表现脆弱。我们认为，有效的拒绝决策需要明确建模攻击风险与防御收益之间的权衡，而不仅仅依赖于意图或攻击分类。在本文中，我们提出了一种基于内容的框架，用于设计和审计网络拒绝政策，使攻击与防御的权衡变得明确。该框架从五个维度对请求进行特征化：攻击行为贡献、攻击风险、技术复杂性、防御收益和合法用户的预期频率，基于请求的技术实质而非表面意图。我们证明，这种基于内容的方法解决了当前前沿模型行为中的不一致性，并允许组织构建可调节的、风险意识的拒绝政策。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.15716

Rethinking Metrics for Lexical Semantic Change Detection

重新思考词汇语义变化检测的度量

Goworek, Roksana, Dubossarsky, Haim

Abstract

Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.

Chinese Translation

词汇语义变化检测（LSCD）越来越依赖于上下文化的语言模型嵌入，然而大多数方法仍然使用一小组语义变化度量来量化变化，主要是平均成对距离（Average Pairwise Distance, APD）和基于词原型的余弦距离（cosine distance over word prototypes, PRT）。我们引入了平均最小距离（Average Minimum Distance, AMD）和对称平均最小距离（Symmetric Average Minimum Distance, SAMD），这两种新度量通过跨时间段的词用法之间的局部对应关系来量化语义变化。在多种语言、编码器模型和表示空间中，我们展示了AMD通常提供更强的性能，特别是在降维和使用非专业编码器时，而SAMD在使用专业编码器时表现优异。我们建议LSCD可以考虑APD和PRT之外的替代语义变化度量，AMD为基于上下文化嵌入的分析提供了一个稳健的选择。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.15730

Causal Effect Estimation with Latent Textual Treatments

潜在文本处理的因果效应估计

Feldman, Omri, Venugopal, Amar, Spiess, Jann, Feder, Amir

Abstract

Understanding the causal effects of text on downstream outcomes is a central task in many applications. Estimating such effects requires researchers to run controlled experiments that systematically vary textual features. While large language models (LLMs) hold promise for generating text, producing and evaluating controlled variation requires more careful attention. In this paper, we present an end-to-end pipeline for the generation and causal estimation of latent textual interventions. Our work first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation. Our pipeline addresses both computational and statistical challenges in text-as-treatment experiments. We demonstrate that naive estimation of causal effects suffers from significant bias as text inherently conflates treatment and covariate information. We describe the estimation bias induced in this setting and propose a solution based on covariate residualization. Our empirical results show that our pipeline effectively induces variation in target features and mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.

Chinese Translation

理解文本对下游结果的因果效应是许多应用中的核心任务。估计这种效应要求研究人员进行控制实验，以系统地变化文本特征。尽管大型语言模型（LLMs）在生成文本方面具有潜力，但产生和评估控制变异需要更加细致的关注。在本文中，我们提出了一种用于潜在文本干预生成和因果估计的端到端流程。我们的工作首先通过稀疏自编码器（SAEs）进行假设生成和引导，随后进行稳健的因果估计。我们的流程解决了文本作为处理实验中的计算和统计挑战。我们证明，因果效应的简单估计存在显著偏差，因为文本本质上混淆了处理和协变量信息。我们描述了在这种情况下引入的估计偏差，并提出了一种基于协变量残差化的解决方案。我们的实证结果表明，我们的流程有效地引入了目标特征的变异，并减轻了估计误差，为文本作为处理设置中的因果效应估计提供了稳健的基础。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.15753

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

资源匮乏语言的研究：使用大型语言模型（LLM）注释器进行历史亚美尼亚语、格鲁吉亚语、希腊语和叙利亚语的词形还原和词性标注

Vidal-Gorène, Chahan, Kindt, Bastien, Cafiero, Florian

Abstract

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Chinese Translation

低资源语言在词形还原和词性（POS）标注等自然语言处理任务中面临持续挑战。本文研究了最近的大型语言模型（LLMs），包括GPT-4变体和开放权重的Mistral模型，在少量样本（few-shot）和零样本（zero-shot）设置下处理这四种历史上和语言上多样的低资源语言的能力：古希腊语、古典亚美尼亚语、古格鲁吉亚语和叙利亚语。我们使用一个新颖的基准，包括对齐的训练和域外测试语料库，评估基础模型在词形还原和词性标注方面的表现，并将其与任务特定的RNN基线PIE进行比较。我们的结果表明，即使在没有微调的情况下，LLMs在大多数语言的词性标注和词形还原任务中也能在少量样本设置下实现具有竞争力或更优的表现。对于具有复杂形态和非拉丁文字的语言，仍然存在显著挑战，但我们证明了LLMs在缺乏数据的情况下是启动语言注释任务的可信且相关的选择，能够有效辅助注释工作。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.15757

Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos

超越二元分类：在社交媒体视频中检测细粒度性别歧视

De Grazia, Laura, Villegas, Danae Sánchez, Elliott, Desmond, Farrús, Mireia, Taulé, Mariona

Abstract

Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.

Chinese Translation

在线性别歧视以多种形式出现，这使得其检测变得具有挑战性。尽管自动化工具可以增强对性别歧视内容的识别，但它们通常仅限于二元分类。因此，由于缺乏细粒度的、上下文敏感的标签，更微妙的性别歧视表现可能会被忽视。为了解决这一问题，我们做出了以下贡献：(1) 我们提出了FineMuSe，这是一个新的西班牙语多模态性别歧视检测数据集，包含二元和细粒度注释；(2) 我们引入了一个全面的层次分类法，涵盖性别歧视、非性别歧视以及讽刺和幽默的修辞手法；(3) 我们评估了多种大型语言模型（LLMs）在二元和细粒度性别歧视检测中的表现。我们的研究结果表明，多模态LLMs在识别细微的性别歧视形式方面与人类注释者的表现相当；然而，当这些性别歧视通过视觉线索传达时，它们在捕捉共现的性别歧视类型方面存在困难。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.15758

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models

ChartEditBench：评估多模态语言模型中的基础多轮图表编辑

Kapadnis, Manav Nitin, Baghel, Lawanya, Naik, Atharva, Rosé, Carolyn

Abstract

While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在单轮图表生成方面表现出色，但它们在支持真实世界探索性数据分析方面的能力仍然未得到充分探索。在实践中，用户通过多轮交互迭代地完善可视化，这需要维持共同基础、跟踪先前的编辑并适应不断变化的偏好。我们引入了ChartEditBench，这是一个通过代码进行增量、视觉基础的图表编辑基准，包含5000个难度控制的修改链和一个经过严格人工验证的子集。与之前的一次性基准不同，ChartEditBench评估持续的、上下文感知的编辑。我们进一步提出了一个强大的评估框架，通过整合基于执行的保真度检查、像素级视觉相似性和逻辑代码验证，来减轻LLM作为评判者指标的局限性。与最先进的MLLMs的实验表明，由于错误累积和共享上下文的崩溃，多轮设置中存在显著的性能下降，在风格编辑方面表现良好，但在数据中心转换时频繁出现执行失败。ChartEditBench为基础的、意图感知的多模态编程建立了一个具有挑战性的测试平台。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.15769

ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

ViTaB-A：评估多模态大型语言模型在视觉表格归属中的表现

Alqurnawi, Yahia, Biswas, Preetom, Rao, Anmol, Anvekar, Tejas, Baral, Chitta, Gupta, Vivek

Abstract

Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.

Chinese Translation

多模态大型语言模型（mLLMs）通常用于回答结构化数据中的问题，例如Markdown、JSON格式的表格和图像。尽管这些模型往往能够给出正确的答案，但用户也需要了解这些答案的来源。在本研究中，我们探讨了结构化数据的归属/引用能力，即模型指向支持答案的特定行和列的能力。我们评估了几种mLLMs在不同表格格式和提示策略下的表现。我们的结果显示，问题回答与证据归属之间存在明显差距。尽管问题回答的准确性保持在中等水平，但归属准确性则明显较低，对于JSON输入几乎接近随机，所有模型均是如此。我们还发现，模型在引用行时比引用列更可靠，并且在处理文本格式时的表现较图像格式更差。最后，我们观察到不同模型家族之间存在显著差异。总体而言，我们的研究结果表明，当前的mLLMs在提供细粒度、可信赖的结构化数据归属方面不够可靠，这限制了它们在需要透明性和可追溯性的应用中的使用。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.15778

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

*-PLUIE：基于大语言模型的可个性化度量以改善评估

Lemesle, Quentin, Jourdan, Léane, Munson, Daisy, Alain, Pierre, Chevelu, Jonathan, Delhay, Arnaud, Lolive, Damien

Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Chinese Translation

自动生成文本的质量评估通常依赖于大语言模型作为评判者（LLM-judge）的方法。尽管这些方法有效，但它们计算成本高且需要后处理。为了解决这些局限性，我们在ParaPLUIE的基础上进行改进，ParaPLUIE是一种基于困惑度的LLM-judge度量，能够在不生成文本的情况下估计对“是/否”答案的信心。我们引入了*-PLUIE，这是ParaPLUIE的任务特定提示变体，并评估其与人类判断的一致性。我们的实验表明，个性化的*-PLUIE在保持低计算成本的同时，与人类评分之间的相关性更强。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.15814

Avey-B

Acharya, Devang, Hammoud, Mohammad

Abstract

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Chinese Translation

紧凑的预训练双向编码器在计算和内存预算紧张的工业自然语言处理（NLP）中仍然是其核心。它们的有效性源于自注意力机制能够以序列级并行性提供高质量的双向上下文化，这一点在BERT风格的架构中得到了广泛推广。最近，Avey被引入作为一种自回归、无注意力的替代方案，自然适应于仅编码器的改编。在本文中，我们为仅编码器范式重新构建了Avey，并提出了对其架构的几项创新，包括解耦的静态和动态参数化、以稳定性为导向的归一化以及神经压缩。结果表明，这种重新构建的架构与四种广泛使用的基于Transformer的编码器相比表现良好，在标准的标记分类和信息检索基准测试中持续超越它们，同时在处理长上下文时具有更高的效率。

View on arXiv Download PDF AI Translation

arXiv Papers

CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation

Safe-SDL:Establishing Safety Boundaries and Control Mechanisms for AI-Driven Self-Driving Laboratories

How Do We Research Human-Robot Interaction in the Age of Large Language Models? A Systematic Review

Augmenting Human Balance with Generic Supernumerary Robotic Limbs

A ROS2 Benchmarking Framework for Hierarchical Control Strategies in Mobile Robots for Mediterranean Greenhouses

DexEvolve: Evolutionary Optimization for Robust and Diverse Dexterous Grasp Synthesis

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

OSCAR: An Ovipositor-Inspired Self-Propelling Capsule Robot for Colonoscopy

Feasibility-aware Imitation Learning from Observation with Multimodal Feedback

A Comparison of Bayesian Prediction Techniques for Mobile Robot Trajectory Tracking

Fluoroscopy-Constrained Magnetic Robot Control via Zernike-Based Field Modeling and Nonlinear MPC

ActionCodec: What Makes for Good Action Tokenizers

Hybrid F' and ROS2 Architecture for Vision-Based Autonomous Flight: Design and Experimental Validation

One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

Lyapunov-Based $\mathcal{L}_2$-Stable PI-Like Control of a Four-Wheel Independently Driven and Steered Robot

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Efficient Knowledge Transfer for Jump-Starting Control Policy Learning of Multirotors through Physics-Aware Neural Architectures

Selective Perception for Robot: Task-Aware Attention in Multimodal VLA

VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing

Constraining Streaming Flow Models for Adapting Learned Robot Trajectory Distributions

Grip as Needed, Glide on Demand: Ultrasonic Lubrication for Robotic Locomotion

SpecFuse: A Spectral-Temporal Fusion Predictive Control Framework for UAV Landing on Oscillating Marine Platforms

Spatially-Aware Adaptive Trajectory Optimization with Controller-Guided Feedback for Autonomous Racing

Estimating Human Muscular Fatigue in Dynamic Collaborative Robotic Tasks with Learning-Based Models

Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems

MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction

Robot-Assisted Social Dining as a White Glove Service

FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features

Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

Distributional Deep Learning for Super-Resolution of 4D Flow MRI under Domain Shift

Time-Archival Camera Virtualization for Sports and Visual Performances

How to Train Your Long-Context Visual Document Model

Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization

Visual Persuasion: What Influences Decisions of Vision-Language Models?

Consistency-Preserving Diverse Video Generation

Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use

Effective and Robust Multimodal Medical Image Analysis

CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

GMAIL: Generative Modality Alignment for generated Image Learning

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Efficient Generative Modeling beyond Memoryless Diffusion via Adjoint Schr\"odinger Bridge Matching

Emergent Morphing Attack Detection in Open Multi-modal Large Language Models

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

LEADER: Lightweight End-to-End Attention-Gated Dual Autoencoder for Robust Minutiae Extraction

Semantic-Guided 3D Gaussian Splatting for Transient Object Removal

Advanced Acceptance Score: A Holistic Measure for Biometric Quantification

Dynamic Training-Free Fusion of Subject and Style LoRAs

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning

An Industrial Dataset for Scene Acquisitions and Functional Schematics Alignment

Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

A Novel Public Dataset for Strawberry (Fragaria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models

Bayesian Optimization for Design Parameters of 3D Image Data Analysis

Criteria-first, semantics-later: reproducible structure discovery in image-based sciences

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding

RaCo: Ranking and Covariance for Practical Learned Keypoints

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy

Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting

Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

Task-Agnostic Continual Learning for Chest Radiograph Classification

VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Attention-gated U-Net model for semantic segmentation of brain tumors and feature extraction for survival prognosis

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Panini: Continual Learning in Token Space via Structured Memory

da Costa and Tarski meet Goguen and Carnap: a novel approach for ontological heterogeneity based on consequence systems

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs