arXiv Daily Digest

301

Papers

GAIDE: Graph-based Attention Masking for Spatial- and Embodiment-aware Motion Planning

GAIDE：基于图的注意力掩蔽用于空间与具身运动规划

Soleymanzadeh, Davood, Liang, Xiao, Zheng, Minghui

Abstract

Sampling-based motion planning algorithms are widely used for motion planning of robotic manipulators, but they often struggle with sample inefficiency in high-dimensional configuration spaces due to their reliance on uniform or hand-crafted informed sampling primitives. Neural informed samplers address this limitation by learning the sampling distribution from prior planning experience to guide the motion planner towards planning goal. However, existing approaches often struggle to encode the spatial structure inherent in motion planning problems. To address this limitation, we introduce Graph-based Attention Masking for Spatial- and Embodiment-aware Motion Planning (GAIDE), a neural informed sampler that leverages both the spatial structure of the planning problem and the robotic manipulator's embodiment to guide the planning algorithm. GAIDE represents these structures as a graph and integrates it into a transformer-based neural sampler through attention masking. We evaluate GAIDE against baseline state-of-the-art sampling-based planners using uniform sampling, hand-crafted informed sampling, and neural informed sampling primitives. Evaluation results demonstrate that GAIDE improves planning efficiency and success rate.

Chinese Translation

基于采样的运动规划算法广泛应用于机器人操纵器的运动规划，但由于依赖于均匀或手工设计的知情采样原语，它们在高维配置空间中常常面临样本效率低下的问题。神经知情采样器通过从先前的规划经验中学习采样分布，以引导运动规划者朝向规划目标，从而解决了这一局限性。然而，现有的方法往往难以编码运动规划问题中固有的空间结构。为了解决这一局限性，我们提出了基于图的注意力掩蔽用于空间与具身运动规划（GAIDE），这是一种神经知情采样器，利用规划问题的空间结构和机器人操纵器的具身性来引导规划算法。GAIDE将这些结构表示为图，并通过注意力掩蔽将其集成到基于变换器的神经采样器中。我们将GAIDE与使用均匀采样、手工设计的知情采样和神经知情采样原语的基线最先进的采样规划器进行了评估。评估结果表明，GAIDE提高了规划效率和成功率。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2603.04466

Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

行动-观察-重写：作为上下文策略学习者的多模态编码代理在机器人操作中的应用

Kumar, Vaishak

Abstract

Can a multimodal language model learn to manipulate physical objects by reasoning about its own failures-without gradient updates, demonstrations, or reward engineering? We argue the answer is yes, under conditions we characterise precisely. We present Act-Observe-Rewrite (AOR), a framework in which an LLM agent improves a robot manipulation policy by synthesising entirely new executable Python controller code between trials, guided by visual observations and structured episode outcomes. Unlike prior work that grounds LLMs in pre-defined skill libraries or uses code generation for one-shot plan synthesis, AOR makes the full low-level motor control implementation the unit of LLM reasoning, enabling the agent to change not just what the robot does, but how it does it. The central claim is that interpretable code as the policy representation creates a qualitatively different kind of in-context learning from opaque neural policies: the agent can diagnose systematic failures and rewrite their causes. We validate this across three robosuite manipulation tasks and report promising results, with the agent achieving high success rates without demonstrations, reward engineering, or gradient updates.

Chinese Translation

多模态语言模型能否通过推理自身的失败来学习操控物理对象，而无需梯度更新、示范或奖励工程？我们认为答案是肯定的，前提是我们准确界定的条件。我们提出了行动-观察-重写（Act-Observe-Rewrite, AOR）框架，在该框架中，LLM代理通过在试验之间合成全新的可执行Python控制器代码来改进机器人操作策略，这一过程受到视觉观察和结构化情境结果的指导。与以往将LLM与预定义技能库结合或使用代码生成进行一次性计划合成的工作不同，AOR将完整的低级运动控制实现作为LLM推理的单元，使代理不仅能够改变机器人所做的事情，还能改变其执行方式。核心论点是，将可解释代码作为策略表示创造了一种与不透明神经策略截然不同的上下文学习方式：代理能够诊断系统性失败并重写其原因。我们在三个robosuite操作任务中验证了这一点，并报告了令人鼓舞的结果，代理在没有示范、奖励工程或梯度更新的情况下实现了高成功率。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2603.04470

Efficient Autonomous Navigation of a Quadruped Robot in Underground Mines on Edge Hardware

基于边缘硬件的四足机器人在地下矿井中的高效自主导航

Gao, Yixiang, Awuah-Offei, Kwame

Abstract

Embodied navigation in underground mines faces significant challenges, including narrow passages, uneven terrain, near-total darkness, GPS-denied conditions, and limited communication infrastructure. While recent learning-based approaches rely on GPU-accelerated inference and extensive training data, we present a fully autonomous navigation stack for a Boston Dynamics Spot quadruped robot that runs entirely on a low-power Intel NUC edge computer with no GPU and no network connectivity requirements. The system integrates LiDAR-inertial odometry, scan-matching localization against a prior map, terrain segmentation, and visibility-graph global planning with a velocity-regulated local path follower, achieving real-time perception-to-action at consistent control rates. After a single mapping pass of the environment, the system handles arbitrary goal locations within the known map without any environment-specific training or learned components. We validate the system through repeated field trials using four target locations of varying traversal difficulty in an experimental underground mine, accumulating over 700 m of fully autonomous traverse with a 100% success rate across all 20 trials (5 repetitions x 4 targets) and an overall Success weighted by Path Length (SPL) of 0.73 \pm 0.09.

Chinese Translation

在地下矿井中的具身导航面临着显著挑战，包括狭窄通道、不平坦地形、几乎完全黑暗、无GPS条件以及有限的通信基础设施。尽管近期的学习基础方法依赖于GPU加速推理和大量训练数据，我们提出了一种完全自主的导航系统，适用于Boston Dynamics的Spot四足机器人，该系统完全运行在低功耗的Intel NUC边缘计算机上，无需GPU和网络连接。该系统集成了激光雷达-惯性测程、基于先前地图的扫描匹配定位、地形分割和可见性图全局规划，以及速度调节的局部路径跟随器，实现了实时感知到行动的一致控制速率。在对环境进行一次映射后，该系统能够处理已知地图内的任意目标位置，而无需任何特定于环境的训练或学习组件。我们通过在一个实验性的地下矿井中进行重复的实地试验来验证该系统，使用四个不同通行难度的目标位置，累计超过700米的完全自主穿越，在所有20次试验中（5次重复 x 4个目标）取得了100%的成功率，整体路径长度加权成功率（Success weighted by Path Length, SPL）为0.73 a 0.09。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2603.04531

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

PTLD：用于灵巧操作的模拟到现实特权触觉潜在蒸馏

Chen, Rosy, Mukadam, Mustafa, Kaess, Michael, Wu, Tingfan, Hogan, Francois R, Malik, Jitendra, Sharma, Akash

Abstract

Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.

Chinese Translation

触觉灵巧操作对于自动化复杂的家庭任务至关重要，但学习有效的控制策略仍然是一个挑战。尽管近期的研究依赖于模仿学习，但通过机器人遥操作或动觉教学获得高质量的多指手部演示是不可行的。另一方面，通过强化学习，我们可以在模拟环境中学习技能，但快速且真实的触觉观察模拟是具有挑战性的。为了解决这一问题，我们提出了PTLD：模拟到现实特权触觉潜在蒸馏，这是一种在不需要触觉模拟的情况下学习触觉操作技能的新方法。我们的关键思想是利用现实世界中的特权传感器收集真实的触觉策略数据，而不是模拟触觉传感器或仅依赖本体感觉策略进行零-shot 的模拟到现实转移。然后，这些数据用于蒸馏一个在触觉输入上运行的强健状态估计器。我们的实验表明，PTLD可以通过结合触觉感知显著改善在模拟中训练的本体感知操作策略。在基准的手内旋转任务中，PTLD相较于仅依赖本体感知的策略实现了182%的提升。我们还展示了PTLD能够学习具有挑战性的触觉手内重新定向任务，在该任务中，相比仅使用本体感知，我们观察到达到目标的数量提高了57%。网站：https://akashsharma02.github.io/ptld-website/

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2603.04547

Many-RRT*: Robust Joint-Space Trajectory Planning for Serial Manipulators

Many-RRT*: 串联机械臂的鲁棒关节空间轨迹规划

Belmont, Theodore M., Christie, Benjamin A., Netchaev, Anton

Abstract

The rapid advancement of high degree-of-freedom (DoF) serial manipulators necessitates the use of swift, sampling-based motion planners for high-dimensional spaces. While sampling-based planners like the Rapidly-Exploring Random Tree (RRT) are widely used, planning in the manipulator's joint space presents significant challenges due to non-invertible forward kinematics. A single task-space end-effector pose can correspond to multiple configuration-space states, creating a multi-arm bandit problem for the planner. In complex environments, simply choosing the wrong joint space goal can result in suboptimal trajectories or even failure to find a viable plan. To address this planning problem, we propose Many-RRT*: an extension of RRT*-Connect that plans to multiple goals in parallel. By generating multiple IK solutions and growing independent trees from these goal configurations simultaneously alongside a single start tree, Many-RRT* ensures that computational effort is not wasted on suboptimal IK solutions. This approach maintains robust convergence and asymptotic optimality. Experimental evaluations across robot morphologies and diverse obstacle environments demonstrate that Many-RRT* provides higher quality trajectories (44.5% lower cost in the same runtime) with a significantly higher success rate (100% vs. the next best of 1.6%) than previous RRT iterations without compromising on runtime performance.

Chinese Translation

高自由度（DoF）串联机械臂的快速发展要求在高维空间中使用快速的基于采样的运动规划器。尽管像快速探索随机树（RRT）这样的基于采样的规划器被广泛使用，但在机械臂的关节空间进行规划面临显著挑战，因为前向运动学是不可逆的。单一任务空间末端执行器姿态可能对应多个配置空间状态，这为规划器带来了多臂强盗问题。在复杂环境中，仅仅选择错误的关节空间目标可能导致次优轨迹，甚至无法找到可行的规划。为了解决这一规划问题，我们提出了Many-RRT*：一种RRT*-Connect的扩展，能够并行规划多个目标。通过生成多个逆运动学（IK）解并从这些目标配置同时生长独立树，Many-RRT*确保计算努力不会浪费在次优的IK解上。这种方法保持了鲁棒的收敛性和渐近最优性。在不同机器人形态和多样障碍环境中的实验评估表明，Many-RRT*提供了更高质量的轨迹（在相同运行时间内成本降低44.5%），成功率显著提高（100%对比下一个最佳的1.6%），且不妥协于运行时间性能。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2603.04560

From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO

从局部修正到泛化技能：通过MEMO提升神经符号策略

Christie, Benjamin A., Dai, Yinlong, Bararjanianbahnamiri, Mohammad, Stepputtis, Simon, Losey, Dylan P.

Abstract

Recent works use a neuro-symbolic framework for general manipulation policies. The advantage of this framework is that -- by applying off-the-shelf vision and language models -- the robot can break complex tasks down into semantic subtasks. However, the fundamental bottleneck is that the robot needs skills to ground these subtasks into embodied motions. Skills can take many forms (e.g., trajectory snippets, motion primitives, coded functions), but regardless of their form skills act as a constraint. The high-level policy can only ground its language reasoning through the available skills; if the robot cannot generate the right skill for the current task, its policy will fail. We propose to address this limitation -- and dynamically expand the robot's skills -- by leveraging user feedback. When a robot fails, humans can intuitively explain what went wrong (e.g., ``no, go higher''). While a simple approach is to recall this exact text the next time the robot faces a similar situation, we hypothesize that by collecting, clustering, and re-phrasing natural language corrections across multiple users and tasks, we can synthesize more general text guidance and coded skill templates. Applying this hypothesis we develop Memory Enhanced Manipulation (MEMO). MEMO builds and maintains a retrieval-augmented skillbook gathered from human feedback and task successes. At run time, MEMO retrieves relevant text and code from this skillbook, enabling the robot's policy to generate new skills while reasoning over multi-task human feedback. Our experiments demonstrate that using MEMO to aggregate local feedback into general skill templates enables generalization to novel tasks where existing baselines fall short. See supplemental material here: https://collab.me.vt.edu/memo

Chinese Translation

近期的研究使用神经符号框架来实现通用的操作策略。该框架的优势在于——通过应用现成的视觉和语言模型——机器人能够将复杂任务分解为语义子任务。然而，根本的瓶颈在于机器人需要技能将这些子任务转化为具体的动作。技能可以有多种形式（例如，轨迹片段、运动原语、编码函数），但无论其形式如何，技能都充当约束。高层策略只能通过可用技能来实现其语言推理；如果机器人无法为当前任务生成合适的技能，其策略将失败。我们提出通过利用用户反馈来解决这一限制，并动态扩展机器人的技能。当机器人失败时，人类可以直观地解释出问题所在（例如，“不，去高一点”）。虽然一种简单的方法是下次机器人面临类似情况时回忆起这段确切的文本，但我们假设通过收集、聚类和重新表述来自多个用户和任务的自然语言修正，我们可以合成更通用的文本指导和编码技能模板。基于这一假设，我们开发了记忆增强操作（Memory Enhanced Manipulation, MEMO）。MEMO构建并维护一个从人类反馈和任务成功中收集的增强检索技能库。在运行时，MEMO从该技能库中检索相关文本和代码，使机器人的策略能够在处理多任务人类反馈时生成新技能。我们的实验表明，使用MEMO将局部反馈聚合为通用技能模板能够实现对新任务的泛化，而现有基准则无法满足这一需求。有关补充材料，请访问：https://collab.me.vt.edu/memo

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2603.04571

Distributed State Estimation for Vision-Based Cooperative Slung Load Transportation in GPS-Denied Environments

在GPS不可用环境中基于视觉的协作吊载运输的分布式状态估计

Pence, Jack R., Fezell, Jackson, Langelaan, Jack W., Geng, Junyi

Abstract

Transporting heavy or oversized slung loads using rotorcraft has traditionally relied on single-aircraft systems, which limits both payload capacity and control authority. Cooperative multilift using teams of rotorcraft offers a scalable and efficient alternative, especially for infrequent but challenging "long-tail" payloads without the need of building larger and larger rotorcraft. Most prior multilift research assumes GPS availability, uses centralized estimation architectures, or relies on controlled laboratory motion-capture setups. As a result, these methods lack robustness to sensor loss and are not viable in GPS-denied or operationally constrained environments. This paper addresses this limitation by presenting a distributed and decentralized payload state estimation framework for vision-based multilift operations. Using onboard monocular cameras, each UAV detects a fiducial marker on the payload and estimates its relative pose. These measurements are fused via a Distributed and Decentralized Extended Information Filter (DDEIF), enabling robust and scalable estimation that is resilient to individual sensor dropouts. This payload state estimate is then used for closed-loop trajectory tracking control. Monte Carlo simulation results in Gazebo show the effectiveness of the proposed approach, including the effect of communication loss during flight.

Chinese Translation

使用旋翼机运输重型或超大吊载传统上依赖于单机系统，这限制了有效载荷能力和控制权。使用旋翼机团队进行协作多吊载提供了一种可扩展且高效的替代方案，特别是对于不频繁但具有挑战性的“长尾”有效载荷，而无需构建越来越大的旋翼机。大多数先前的多吊载研究假设GPS可用，使用集中式估计架构，或依赖于受控的实验室运动捕捉设置。因此，这些方法在传感器丢失时缺乏鲁棒性，无法在GPS不可用或操作受限的环境中使用。本文通过提出一种基于视觉的多吊载操作的分布式和去中心化有效载荷状态估计框架来解决这一限制。通过机载单目相机，每个无人机检测有效载荷上的标志物并估计其相对姿态。这些测量通过分布式和去中心化扩展信息滤波器（DDEIF）进行融合，使得估计具有鲁棒性和可扩展性，能够抵御个别传感器的掉落。然后，该有效载荷状态估计用于闭环轨迹跟踪控制。Gazebo中的蒙特卡洛仿真结果显示了所提方法的有效性，包括飞行过程中通信丢失的影响。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2603.04579

Risk-Aware Reinforcement Learning for Mobile Manipulation

面向风险的强化学习在移动操作中的应用

Groom, Michael, Wilson, James, Hawes, Nick, Kunze, Lars

Abstract

For robots to successfully transition from lab settings to everyday environments, they must begin to reason about the risks associated with their actions and make informed, risk-aware decisions. This is particularly true for robots performing mobile manipulation tasks, which involve both interacting with and navigating within dynamic, unstructured spaces. However, existing whole-body controllers for mobile manipulators typically lack explicit mechanisms for risk-sensitive decision-making under uncertainty. To our knowledge, we are the first to (i) learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and (ii) show risk-aware behaviours can be transferred through Imitation Learning (IL) to a visuomotor policy conditioned on egocentric depth observations. Our method achieves this by first training a privileged teacher policy using Distributional Reinforcement Learning (DRL), with a risk-neutral distributional critic. Distortion risk-metrics are then applied to the critic's predicted return distribution to calculate risk-adjusted advantage estimates used in policy updates to achieve a range of risk-aware behaviours. We then distil teacher policies with IL to obtain risk-aware student policies conditioned on egocentric depth observations. We perform extensive evaluations demonstrating that our trained visuomotor policies exhibit risk-aware behaviour (specifically achieving better worst-case performance) while performing reactive whole-body motions in unmapped environments, leveraging live depth observations for perception.

Chinese Translation

为了使机器人成功地从实验室环境过渡到日常环境，它们必须开始考虑与其行为相关的风险，并做出知情的、面向风险的决策。这对于执行移动操作任务的机器人尤为重要，因为这些任务涉及在动态、非结构化空间中进行交互和导航。然而，现有的移动操控器整体控制器通常缺乏在不确定性下进行风险敏感决策的明确机制。据我们所知，我们首次实现了(i) 在以自我中心深度观测为条件的情况下，学习具有运行时可调风险敏感度的面向风险的视觉运动策略，以及(ii) 通过模仿学习（Imitation Learning, IL）将面向风险的行为转移到以自我中心深度观测为条件的视觉运动策略。我们的方法首先通过分布式强化学习（Distributional Reinforcement Learning, DRL）训练一个特权教师策略，使用风险中性的分布式评论员。然后，将失真风险指标应用于评论员预测的回报分布，以计算用于策略更新的风险调整优势估计，从而实现一系列面向风险的行为。接着，我们通过模仿学习提炼教师策略，以获得以自我中心深度观测为条件的面向风险的学生策略。我们进行了广泛的评估，证明我们训练的视觉运动策略在未映射环境中执行反应式整体运动时表现出面向风险的行为（特别是在最坏情况下实现更好的性能），并利用实时深度观测进行感知。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2603.04585

ELLIPSE: Evidential Learning for Robust Waypoints and Uncertainties

ELLIPSE：用于鲁棒航点和不确定性的证据学习

Dong, Zihao, Chung, Chanyoung, Kim, Dong-Ki, Maulimov, Mukhtar, Meng, Xiangyun, Khambhaita, Harmish, Agha-mohammadi, Ali-akbar, Shaban, Amirreza

Abstract

Robust waypoint prediction is crucial for mobile robots operating in open-world, safety-critical settings. While Imitation Learning (IL) methods have demonstrated great success in practice, they are susceptible to distribution shifts: the policy can become dangerously overconfident in unfamiliar states. In this paper, we present \textit{ELLIPSE}, a method building on multivariate deep evidential regression to output waypoints and multivariate Student-t predictive distributions in a single forward pass. To reduce covariate-shift-induced overconfidence under viewpoint and pose perturbations near expert trajectories, we introduce a lightweight domain augmentation procedure that synthesizes plausible viewpoint/pose variations without collecting additional demonstrations. To improve uncertainty reliability under environment/domain shift (e.g., unseen staircases), we apply a post-hoc isotonic recalibration on probability integral transform (PIT) values so that prediction sets remain plausible during deployment. We ground the discussion and experiments in staircase waypoint prediction, where obtaining robust waypoint and uncertainty is pivotal. Extensive real world evaluations show that \textit{ELLIPSE} improves both task success rate and uncertainty coverage compared to baselines.

Chinese Translation

鲁棒的航点预测对于在开放世界和安全关键环境中操作的移动机器人至关重要。尽管模仿学习（Imitation Learning, IL）方法在实践中取得了巨大成功，但它们对分布变化敏感：在不熟悉的状态下，策略可能变得过于自信。在本文中，我们提出了 extit{ELLIPSE}，该方法基于多变量深度证据回归，能够在单次前向传播中输出航点和多变量Student-t预测分布。为了减少在专家轨迹附近由于视角和姿态扰动引起的协变量偏移导致的过度自信，我们引入了一种轻量级的领域增强程序，该程序在不收集额外演示的情况下合成合理的视角/姿态变化。为了提高在环境/领域变化（例如未见过的楼梯）下的不确定性可靠性，我们对概率积分变换（Probability Integral Transform, PIT）值应用后验等距重新校准，以确保在部署期间预测集保持合理。我们将讨论和实验基于楼梯航点预测，在该领域中获得鲁棒的航点和不确定性至关重要。大量的现实世界评估表明，与基线相比， extit{ELLIPSE}在任务成功率和不确定性覆盖方面均有所改善。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2603.04639

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME：评估和理解机器人通用策略的记忆

Dai, Yinpei, Fu, Hongze, Lee, Jayjun, Liu, Yuejiang, Zhang, Haoran, Yang, Jianing, Finn, Chelsea, Fazeli, Nima, Chai, Joyce

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Chinese Translation

记忆对于长时间跨度和依赖历史的机器人操作至关重要。这类任务通常涉及计数重复动作或操控暂时被遮挡的物体。近期的视觉-语言-动作（VLA）模型已开始引入记忆机制；然而，它们的评估仍局限于狭窄的、非标准化的环境。这限制了对它们的系统理解、比较和进展测量。为了解决这些挑战，我们引入了RoboMME：一个大规模标准化基准，用于评估和推动VLA模型在长时间跨度、依赖历史的场景中的应用。我们的基准包含16个操作任务，这些任务是在经过精心设计的分类法下构建的，评估时间、空间、物体和程序记忆。我们进一步开发了一套基于{ ext{π}}0.5主干的14个增强记忆的VLA变体，以系统地探索不同的记忆表示在多种集成策略中的表现。实验结果表明，记忆表示的有效性高度依赖于任务，每种设计在不同任务中提供了独特的优势和局限性。视频和代码可以在我们的网站 https://robomme.github.io 找到。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2603.04642

Autonomous Aerial Non-Destructive Testing: Ultrasound Inspection with a Commercial Quadrotor in an Unstructured Environment

自主空中无损检测：在非结构化环境中使用商业四旋翼进行超声波检测

Veenstra, Ruben, Bazzana, Barbara, Smits, Sander, Franchi, Antonio

Abstract

This work presents an integrated control and software architecture that enables arguably the first fully autonomous, contact-based non-destructive testing (NDT) using a commercial multirotor originally restricted to remotely-piloted operations. To allow autonomous operation with an off-the-shelf platform, we developed a real-time framework that interfaces directly with its onboard sensor suite. The architecture features a multi-rate control scheme: low-level control is executed at 200 Hz, force estimation at 100 Hz, while an admittance filter and trajectory planner operate at 50 Hz, ultimately supplying acceleration and yaw rate commands to the internal flight controller. We validate the system through physical experiments on a Flyability Elios 3 quadrotor equipped with an ultrasound payload. Relying exclusively on onboard sensing, the vehicle successfully performs autonomous NDT measurements within an unstructured, industrial-like environment. This work demonstrates the viability of retrofitting off-the-shelf platforms for autonomous physical interaction, paving the way for safe, contact-based inspection of hazardous and confined infrastructure.

Chinese Translation

本研究提出了一种集成控制和软件架构，使得使用原本仅限于遥控操作的商业多旋翼进行完全自主的接触式无损检测（NDT）成为可能。为了实现与现成平台的自主操作，我们开发了一个实时框架，直接与其机载传感器套件接口。该架构采用多速率控制方案：低级控制以200 Hz执行，力估计以100 Hz进行，而导纳滤波器和轨迹规划器以50 Hz运行，最终向内部飞行控制器提供加速度和偏航率指令。我们通过在配备超声波负载的Flyability Elios 3四旋翼上进行物理实验来验证该系统。依靠机载传感器，飞行器成功地在一个非结构化的工业环境中执行自主的NDT测量。本研究展示了对现成平台进行改装以实现自主物理交互的可行性，为危险和受限基础设施的安全接触式检测铺平了道路。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2603.04659

GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning

GIANT - 全球路径整合与注意力图网络用于多智能体轨迹规划

Sejersen, Jonas le Fevre, Suzumura, Toyotaro, Kayacan, Erdal

Abstract

This paper presents a novel approach to multi-robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre-planned global paths, allowing robots to adhere to optimal routes while dynamically adjusting to environmental changes. The models robustness is enhanced through the introduction of noise during training, resulting in superior performance in complex, dynamic environments. Our approach is evaluated against established baselines, including NH-ORCA, DRL-NAV, and GA3C-CADRL, across various structurally diverse simulated scenarios. The results demonstrate that our model achieves consistently higher success rates, lower collision rates, and more efficient navigation, particularly in challenging scenarios where baseline models struggle. This work offers an advancement in multi-robot navigation, with implications for robust performance in complex, dynamic environments with varying degrees of complexity, such as those encountered in logistics, where adaptability is essential for accommodating unforeseen obstacles and unpredictable changes.

Chinese Translation

本文提出了一种新颖的多机器人避碰方法，该方法将全球路径规划与局部导航策略相结合，利用注意力图神经网络管理智能体之间的动态交互。我们引入了一种局部导航模型，该模型利用预先规划的全球路径，使机器人能够遵循最优路线，同时动态调整以适应环境变化。通过在训练过程中引入噪声，模型的鲁棒性得到了增强，从而在复杂动态环境中表现出更优的性能。我们的方案与已建立的基线模型进行比较，包括 NH-ORCA、DRL-NAV 和 GA3C-CADRL，评估了多种结构多样的模拟场景。结果表明，我们的模型在成功率、碰撞率和导航效率方面均表现出持续的优势，特别是在基线模型表现不佳的挑战性场景中。该研究为多机器人导航提供了进展，对在复杂动态环境中实现鲁棒性能具有重要意义，尤其是在物流等领域，这些领域需要适应不可预见的障碍和不可预测的变化。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2603.04668

Python Bindings for a Large C++ Robotics Library: The Case of OMPL

大型 C++ 机器人库的 Python 绑定：以 OMPL 为例

Guo, Weihang, Tyrovouzis, Theodoros, Kavraki, Lydia E.

Abstract

Python bindings are a critical bridge between high-performance C++ libraries and the flexibility of Python, enabling rapid prototyping, reproducible experiments, and integration with simulation and learning frameworks in robotics research. Yet, generating bindings for large codebases is a tedious process that creates a heavy burden for a small group of maintainers. In this work, we investigate the use of Large Language Models (LLMs) to assist in generating nanobind wrappers, with human experts kept in the loop. Our workflow mirrors the structure of the C++ codebase, scaffolds empty wrapper files, and employs LLMs to fill in binding definitions. Experts then review and refine the generated code to ensure correctness, compatibility, and performance. Through a case study on a large C++ motion planning library, we document common failure modes, including mismanaging shared pointers, overloads, and trampolines, and show how in-context examples and careful prompt design improve reliability. Experiments demonstrate that the resulting bindings achieve runtime performance comparable to legacy solutions. Beyond this case study, our results provide general lessons for applying LLMs to binding generation in large-scale C++ projects.

Chinese Translation

Python 绑定是高性能 C++ 库与 Python 灵活性之间的重要桥梁，使得快速原型开发、可重复实验以及与机器人研究中的仿真和学习框架的集成成为可能。然而，为大型代码库生成绑定是一个繁琐的过程，这给一小部分维护者带来了沉重的负担。在本研究中，我们探讨了使用大型语言模型（Large Language Models, LLMs）来辅助生成 nanobind 包装器的可能性，同时保持人类专家的参与。我们的工作流程与 C++ 代码库的结构相似，搭建空的包装器文件，并利用 LLM 填充绑定定义。随后，专家对生成的代码进行审查和完善，以确保其正确性、兼容性和性能。通过对一个大型 C++ 运动规划库的案例研究，我们记录了常见的失败模式，包括对共享指针、重载和跳板的误管理，并展示了上下文示例和精心设计的提示如何提高可靠性。实验表明，生成的绑定在运行时性能上可与传统解决方案相媲美。除了这个案例研究，我们的结果为在大规模 C++ 项目中应用 LLMs 进行绑定生成提供了普遍的经验教训。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2603.04695

Selecting Spots by Explicitly Predicting Intention from Motion History Improves Performance in Autonomous Parking

通过从运动历史中明确预测意图来选择停车位，提高自主停车性能

Chung, Long Kiu, Isele, David, Tariq, Faizan M., Bae, Sangjae, Kousik, Shreyas, D'sa, Jovin

Abstract

In many applications of social navigation, existing works have shown that predicting and reasoning about human intentions can help robotic agents make safer and more socially acceptable decisions. In this work, we study this problem for autonomous valet parking (AVP), where an autonomous vehicle ego agent must drop off its passengers, explore the parking lot, find a parking spot, negotiate for the spot with other vehicles, and park in the spot without human supervision. Specifically, we propose an AVP pipeline that selects parking spots by explicitly predicting where other agents are going to park from their motion history using learned models and probabilistic belief maps. To test this pipeline, we build a simulation environment with reactive agents and realistic modeling assumptions on the ego agent, such as occlusion-aware observations, and imperfect trajectory prediction. Simulation experiments show that our proposed method outperforms existing works that infer intentions from future predicted motion or embed them implicitly in end-to-end models, yielding better results in prediction accuracy, social acceptance, and task completion. Our key insight is that, in parking, where driving regulations are more lax, explicit intention prediction is crucial for reasoning about diverse and ambiguous long-term goals, which cannot be reliably inferred from short-term motion prediction alone, but can be effectively learned from motion history.

Chinese Translation

在许多社会导航的应用中，现有研究表明，预测和推理人类意图可以帮助机器人代理做出更安全和更具社会接受度的决策。在本研究中，我们探讨了自主代客停车（AVP）中的这一问题，其中自主车辆的自我代理必须将乘客送下车，探索停车场，寻找停车位，与其他车辆协商停车位，并在没有人类监督的情况下停车。具体而言，我们提出了一种AVP流程，通过使用学习模型和概率信念图，从其他代理的运动历史中明确预测他们将要停车的位置来选择停车位。为了测试该流程，我们构建了一个具有反应性代理和对自我代理的现实建模假设的仿真环境，例如考虑遮挡的观察和不完美的轨迹预测。仿真实验表明，我们提出的方法在预测准确性、社会接受度和任务完成度方面优于现有的从未来预测运动中推断意图或在端到端模型中隐式嵌入意图的研究。我们的关键见解是，在停车过程中，驾驶规则较为宽松，明确的意图预测对于推理多样化和模糊的长期目标至关重要，这些目标不能仅通过短期运动预测可靠推断，但可以有效地从运动历史中学习。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2603.04705

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

LEGS-POMDP：在部分可观察环境中进行语言和手势引导的物体搜索

He, Ivy Xiao, Tellex, Stefanie, Liu, Jason Xinyu

Abstract

To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.

Chinese Translation

为了帮助人类在开放世界环境中，机器人必须解读模糊的指令以定位所需的物体。基于基础模型的方法在多模态基础上表现出色，但在长时间任务中缺乏一个有原则的机制来建模不确定性。相比之下，部分可观察马尔可夫决策过程（POMDP）提供了一个系统化的框架，用于在不确定性下进行规划，但通常在支持的模态上受到限制，并依赖于严格的环境假设。我们提出了部分可观察环境中的语言和手势引导物体搜索（LEGS-POMDP），这是一个模块化的POMDP系统，整合了语言、手势和视觉观察，用于开放世界的物体搜索。与之前的工作不同，LEGS-POMDP明确建模了两种部分可观察性的来源：目标物体身份的不确定性和其空间位置的不确定性。在仿真中，多模态融合显著优于单模态基线，在具有挑战性的环境和物体类别中实现了89%的平均成功率。最后，我们在一个四足移动操控器上展示了完整系统，真实世界的实验定性验证了在模糊指令下强大的多模态感知和不确定性降低。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2603.04714

Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors

设计、映射与接触预测：基于3D打印的全身触觉与接近传感器

Kohlbrenner, Carson, Soukhovei, Anna, Escobedo, Caleb, Nechyporenko, Nataliya, Roncone, Alessandro

Abstract

Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space -- the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.

Chinese Translation

在动态和共享环境中运行的机器人通过在接触发生之前进行预测而受益。我们提出了GenTact-Prox，这是一种完全3D打印的人工皮肤，集成了触觉和接近传感器用于接触检测和预测。该人工皮肤平台具有模块化设计，能够根据任何机器人形态进行程序生成，并可覆盖机器人全身。在评估过程中，该皮肤的检测范围达到了18厘米。为了表征机器人如何通过该皮肤感知周围空间，我们引入了一个数据驱动的框架，用于映射感知空间（Perisensory Space）——这是围绕机器人身体的空间体积，其中传感器提供可操作的信息以进行接触预测。我们在一台配备五个GenTact-Prox单元的Franka Research 3机器人上演示了这一方法，使其能够进行在线物体感知操作和接触预测。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2603.04757

Gait Generation Balancing Joint Load and Mobility for Legged Modular Robots with Easily Detachable Joints

平衡关节负载与灵活性的步态生成方法：针对具有易拆卸关节的腿式模块化机器人

Chihara, Kennosuke, Kiyokawa, Takuya, Harada, Kensuke

Abstract

While modular robots offer versatility, excessive joint torque during locomotion poses a significant risk of mechanical failure, especially for detachable joints. To address this, we propose an optimization framework using the NSGA-III algorithm. Unlike conventional approaches that prioritize mobility alone, our method derives Pareto optimal solutions to minimize joint load while maintaining necessary locomotion speed and stability. Simulations and physical experiments demonstrate that our approach successfully generates gait motions for diverse environments, such as slopes and steps, ensuring structural integrity without compromising overall mobility.

Chinese Translation

尽管模块化机器人具有多样性，但在运动过程中，过大的关节扭矩会显著增加机械故障的风险，尤其是对于可拆卸关节。为了解决这一问题，我们提出了一种基于 NSGA-III 算法的优化框架。与传统方法仅关注灵活性不同，我们的方法通过推导帕累托最优解，旨在最小化关节负载，同时保持必要的运动速度和稳定性。仿真和实物实验表明，我们的方法成功地为多种环境（如坡道和台阶）生成步态运动，确保结构完整性而不妨碍整体灵活性。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2603.04760

Designing and Validating a Self-Aligning Tool Changer for Modular Reconfigurable Manipulation Robots

为模块化可重构操作机器人设计与验证自对准工具更换器

Maskur, Mahfudz, Kiyokawa, Takuya, Harada, Kensuke

Abstract

Modular reconfigurable robots require reliable mechanisms for automated module exchange, but conventional rigid active couplings often fail due to inevitable positioning and orientational errors. To address this, we propose a misalignment-tolerant tool-changing system. The hardware features a motor-driven coupling utilizing passive self-alignment geometries, specifically chamfered receptacles and triangular lead-in guides, to robustly compensate for angular and lateral misalignments without complex force sensors. To make this autonomous exchange practically feasible, the mechanism is complemented by a compact rotating tool exchange station for efficient module storage. Real-world autonomous tool-picking experiments validate that the self-aligning features successfully absorb execution errors, enabling highly reliable robotic tool reconfiguration.

Chinese Translation

模块化可重构机器人需要可靠的机制来实现自动模块交换，但传统的刚性主动耦合常因不可避免的定位和方向误差而失效。为了解决这一问题，我们提出了一种耐错位的工具更换系统。该硬件采用电机驱动的耦合，利用被动自对准几何结构，特别是倒角插座和三角导入导轨，能够稳健地补偿角度和横向错位，而无需复杂的力传感器。为了使这一自主交换在实际中可行，该机制配备了一个紧凑的旋转工具交换站，以实现高效的模块存储。实际的自主工具拾取实验验证了自对准特性成功吸收执行误差，从而实现高度可靠的机器人工具重构。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2603.04761

Adaptive Policy Switching of Two-Wheeled Differential Robots for Traversing over Diverse Terrains

两轮差动机器人在多样地形上行驶的自适应策略切换

Izawa, Haruki, Takai, Takeshi, Kitano, Shingo, Miyaguchi, Mikita, Kawashima, Hiroaki

Abstract

Exploring lunar lava tubes requires robots to traverse without human intervention. Because pre-trained policies cannot fully cover all possible terrain conditions, our goal is to enable adaptive policy switching, where the robot selects an appropriate terrain-specialized model based on its current terrain features. This study investigates whether terrain types can be estimated effectively using posture-related observations collected during navigation. We fine-tuned a pre-trained policy using Proximal Policy Optimization (PPO), and then collected the robot's 3D orientation data as it moved across flat and rough terrain in a simulated lava-tube environment. Our analysis revealed that the standard deviation of the robot's pitch data shows a clear difference between these two terrain types. Using Gaussian mixture models (GMM), we evaluated terrain classification across various window sizes. An accuracy of more than 98% was achieved when using a 70-step window. The result suggests that short-term orientation data are sufficient for reliable terrain estimation, providing a foundation for adaptive policy switching.

Chinese Translation

探索月球熔岩管需要机器人在没有人类干预的情况下进行穿越。由于预训练的策略无法完全覆盖所有可能的地形条件，我们的目标是实现自适应策略切换，使机器人能够根据当前地形特征选择合适的地形专用模型。本研究探讨了是否可以有效地利用在导航过程中收集的与姿态相关的观察数据来估计地形类型。我们使用近端策略优化（Proximal Policy Optimization, PPO）对预训练策略进行了微调，然后在模拟的熔岩管环境中收集了机器人在平坦和粗糙地形上移动时的三维方向数据。我们的分析显示，机器人的俯仰数据的标准差在这两种地形类型之间存在明显差异。通过使用高斯混合模型（Gaussian Mixture Models, GMM），我们评估了不同窗口大小下的地形分类。当使用70步窗口时，准确率超过98%。结果表明，短期的方向数据足以进行可靠的地形估计，为自适应策略切换提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2603.04762

LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams

基于大型语言模型引导的自组织机器人团队去中心化探索

Kawashima, Hiroaki, Ikejima, Shun, Takai, Takeshi, Miyaguchi, Mikita, Kunii, Yasuharu

Abstract

When individual robots have limited sensing capabilities or insufficient fault tolerance, it becomes necessary for multiple robots to form teams during exploration, thereby increasing the collective observation range and reliability. Traditionally, swarm formation has often been managed by a central controller; however, from the perspectives of robustness and flexibility, it is preferable for the swarm to operate autonomously even in the absence of centralized control. In addition, the determination of exploration targets for each team is crucial for efficient exploration in such multi-team exploration scenarios. This study therefore proposes an exploration method that combines (1) an algorithm for self-organization, enabling the autonomous and dynamic formation of multiple teams, and (2) an algorithm that allows each team to autonomously determine its next exploration target (destination). In particular, for (2), this study explores a novel strategy based on large language models (LLMs), while classical frontier-based methods and deep reinforcement learning approaches have been widely studied. The effectiveness of the proposed method was validated through simulations involving tens to hundreds of robots.

Chinese Translation

当单个机器人具有有限的感知能力或不足的容错能力时，多个机器人在探索过程中组成团队是必要的，从而增加集体观察范围和可靠性。传统上，群体的形成通常由中央控制器管理；然而，从鲁棒性和灵活性的角度来看，即使在缺乏集中控制的情况下，群体自主操作更为理想。此外，在这种多团队探索场景中，为每个团队确定探索目标对于高效探索至关重要。因此，本研究提出了一种探索方法，结合了（1）自组织算法，使多个团队能够自主和动态地形成，以及（2）允许每个团队自主确定其下一个探索目标（目的地）的算法。特别地，对于（2），本研究探索了一种基于大型语言模型（LLMs）的新策略，而经典的基于边界的方法和深度强化学习方法已被广泛研究。通过涉及数十到数百个机器人的仿真实验验证了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2603.04787

Data-Driven Control of a Magnetically Actuated Fish-Like Robot

基于数据驱动的磁致动鱼形机器人控制

Koyama, Akiyuki, Kawashima, Hiroaki

Abstract

Magnetically actuated fish-like robots offer promising solutions for underwater exploration due to their miniaturization and agility; however, precise control remains a significant challenge because of nonlinear fluid dynamics, flexible fin hysteresis, and the variable-duration control steps inherent to the actuation mechanism. This paper proposes a comprehensive data-driven control framework to address these complexities without relying on analytical modeling. Our methodology comprises three core components: 1) developing a forward dynamics model (FDM) using a neural network trained on real-world experimental data to capture state transitions under varying time steps; 2) integrating this FDM into a gradient-based model predictive control (G-MPC) architecture to optimize control inputs for path following; and 3) applying imitation learning to approximate the G-MPC policy, thereby reducing the computational cost for real-time implementation. We validate the approach through simulations utilizing the identified dynamics model. The results demonstrate that the G-MPC framework achieves accurate path convergence with minimal root mean square error (RMSE), and the imitation learning controller (ILC) effectively replicates this performance. This study highlights the potential of data-driven control strategies for the precise navigation of miniature, fish-like soft robots.

Chinese Translation

磁致动鱼形机器人因其小型化和灵活性为水下探索提供了有前景的解决方案；然而，由于非线性流体动力学、柔性鳍的滞后效应以及致动机制固有的可变持续时间控制步骤，精确控制仍然是一个重大挑战。本文提出了一种全面的数据驱动控制框架，以应对这些复杂性，而无需依赖于分析模型。我们的方法论包括三个核心组成部分：1）使用基于真实实验数据训练的神经网络开发前向动力学模型（FDM），以捕捉在不同时间步长下的状态转变；2）将该FDM集成到基于梯度的模型预测控制（G-MPC）架构中，以优化路径跟踪的控制输入；3）应用模仿学习来近似G-MPC策略，从而降低实时实施的计算成本。我们通过利用识别的动力学模型进行的仿真验证了该方法。结果表明，G-MPC框架实现了以最小均方根误差（RMSE）达到准确的路径收敛，而模仿学习控制器（ILC）有效地复制了这一性能。本研究突显了数据驱动控制策略在微型鱼形软机器人精确导航中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2603.04819

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

开放集具身辅助数据的优势与劣势

Tambwekar, Pradyumna, Silva, Andrew, Gopinath, Deepak, DeCastro, Jonathan, Cui, Xiongyi, Rosman, Guy

Abstract

Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called \textbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

Chinese Translation

具身基础模型在机器人技术和自动驾驶等现实领域的表现日益出色。这些模型通常在交互或辅助环境中部署，在这些环境中，确保辅助模型能够对新用户和新任务进行泛化至关重要。多样化的交互数据生成为提供数据高效的泛化能力开辟了有希望的途径。本文研究了在合成领域中，针对多样化交互辅助数据微调的多模态基础模型的泛化能力。我们从两个方面探讨泛化能力：a) 对未见用户行为类别的辅助，以及 b) 在训练过程中未遇到的新配置中提供指导。我们研究了一种广泛的能力，称为 extbf{开放集纠正辅助}，在此任务中，模型需要检查用户的长时间行为，并通过纠正行动或基于语言的反馈提供辅助。该任务在以往的研究中尚未解决，通常假设闭合的纠正类别或依赖外部规划者，使其成为评估辅助数据极限的挑战性测试平台。为了支持这一任务，我们在《Overcooked》游戏中生成合成辅助数据集，并微调基于LLaMA的模型，以评估对新任务和用户行为的泛化能力。我们的方法为实现开放集辅助智能所需的辅助数据集的性质提供了关键见解。特别是，我们展示了高效模型受益于涵盖辅助不同方面的数据集，包括多模态基础、缺陷推断和对多样化场景的暴露。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2603.04845

Task-Relevant and Irrelevant Region-Aware Augmentation for Generalizable Vision-Based Imitation Learning in Agricultural Manipulation

面向可推广的基于视觉的模仿学习的任务相关与无关区域感知增强在农业操作中的应用

Hattori, Shun, Sasaki, Hikaru, Hachimine, Takumi, Mizutani, Yusuke, Matsubara, Takamitsu

Abstract

Vision-based imitation learning has shown promise for robotic manipulation; however, its generalization remains limited in practical agricultural tasks. This limitation stems from scarce demonstration data and substantial visual domain gaps caused by i) crop-specific appearance diversity and ii) background variations. To address this limitation, we propose Dual-Region Augmentation for Imitation Learning (DRAIL), a region-aware augmentation framework designed for generalizable vision-based imitation learning in agricultural manipulation. DRAIL explicitly separates visual observations into task-relevant and task-irrelevant regions. The task-relevant region is augmented in a domain-knowledge-driven manner to preserve essential visual characteristics, while the task-irrelevant region is aggressively randomized to suppress spurious background correlations. By jointly handling both sources of visual variation, DRAIL promotes learning policies that rely on task-essential features rather than incidental visual cues. We evaluate DRAIL on diffusion policy-based visuomotor controllers through robot experiments on artificial vegetable harvesting and real lettuce defective leaf picking preparation tasks. The results show consistent improvements in success rates under unseen visual conditions compared to baseline methods. Further attention analysis and representation generalization metrics indicate that the learned policies rely more on task-essential visual features, resulting in enhanced robustness and generalization.

Chinese Translation

基于视觉的模仿学习在机器人操作中展现了良好的前景；然而，其在实际农业任务中的推广能力仍然有限。这一局限性源于稀缺的演示数据以及由i) 作物特定外观多样性和ii) 背景变化引起的显著视觉领域差距。为了解决这一问题，我们提出了面向模仿学习的双区域增强框架（Dual-Region Augmentation for Imitation Learning, DRAIL），该框架旨在实现农业操作中可推广的基于视觉的模仿学习。DRAIL明确将视觉观察分为任务相关区域和任务无关区域。任务相关区域以基于领域知识的方式进行增强，以保留重要的视觉特征，而任务无关区域则被积极随机化，以抑制虚假的背景相关性。通过共同处理这两种视觉变化源，DRAIL促进了学习策略的形成，使其依赖于任务必需的特征，而非偶然的视觉线索。我们通过在人工蔬菜采摘和真实生菜缺陷叶片采摘准备任务上的机器人实验，评估了DRAIL在基于扩散策略的视觉运动控制器上的表现。结果显示，与基线方法相比，在未见视觉条件下成功率有了一致的提升。进一步的注意力分析和表示推广指标表明，学习到的策略更依赖于任务必需的视觉特征，从而增强了鲁棒性和推广能力。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2603.04848

Hyperbolic Multiview Pretraining for Robotic Manipulation

用于机器人操作的双曲多视角预训练

Yang, Jin, Wei, Ping, Chen, Yixin

Abstract

3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits their ability to model structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for \underline{Hyper}bolic \underline{M}ulti\underline{V}iew \underline{P}retraining. Hyperbolic space offers geometric properties well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.

Chinese Translation

具有3D感知的视觉预训练已被证明在提升下游机器人操作任务的性能方面有效。然而，现有方法受限于欧几里得嵌入空间，其平坦的几何特性限制了它们建模嵌入之间结构关系的能力。因此，它们在学习对机器人应用中稳健的空间感知至关重要的结构化嵌入时面临困难。为此，我们提出了HyperMVP，一个自监督框架，用于双曲多视角预训练（Hyperbolic MultiView Pretraining）。双曲空间提供了适合捕捉结构关系的几何特性。在方法上，我们扩展了掩码自编码器范式，并设计了一个GeoLink编码器来学习多视角的双曲表示。然后，预训练的编码器在操作任务上与视觉运动策略进行微调。此外，我们引入了3D-MOV，一个包含多种类型3D点云的大规模数据集，以支持预训练。我们在COLOSSEUM、RLBench和现实场景中评估了HyperMVP，在多样化的任务和扰动设置中，它始终优于强基线。我们的结果突显了在非欧几里得空间中进行3D感知预训练的潜力，以学习稳健且具有可推广性的机器人操作策略。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2603.04910

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

VPWEM：具有工作记忆和情节记忆的非马尔可夫视觉运动策略

Lei, Yuheng, Liang, Zhixuan, Zhang, Hongyuan, Luo, Ping

Abstract

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

Chinese Translation

从人类示范中学习模仿在机器人控制中取得了显著成功，然而大多数视觉运动策略仍然依赖于单步观察或短期上下文历史，使其在需要长期记忆的非马尔可夫任务中表现不佳。简单地扩大上下文窗口会带来巨大的计算和内存成本，并促使模型过拟合虚假的相关性，导致在分布转移下出现灾难性失败，并违反机器人系统中的实时约束。相比之下，人类能够将重要的过去经验压缩为长期记忆，并利用这些记忆在其一生中解决任务。本文提出了VPWEM，一种配备工作记忆和情节记忆的非马尔可夫视觉运动策略。VPWEM保留了一段最近观察标记的滑动窗口作为短期工作记忆，并引入了一种基于Transformer的上下文记忆压缩器，该压缩器递归地将窗口外的观察转换为固定数量的情节记忆标记。该压缩器在过去摘要标记的缓存上使用自注意力，并在历史观察的缓存上使用交叉注意力，并与策略共同训练。我们在扩散策略上实例化VPWEM，以利用短期和情节范围的信息生成动作，且每步的内存和计算几乎保持不变。实验表明，VPWEM在MIKASA中的内存密集型操作任务上超过了包括扩散策略和视觉-语言-动作（VLA）模型在内的最先进基线超过20%，并在移动操作基准MoMaRT上实现了平均5%的提升。代码可在 https://github.com/HarryLui98/code_vpwem 获取。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2603.04913

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

超越补丁：通过视角一致的3D对抗物体探索视觉运动策略的脆弱性

Lee, Chanmi, Yoon, Minsung, Kim, Woojae, Lee, Sebin, Yoon, Sung-eui

Abstract

Neural network-based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera-object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.

Chinese Translation

基于神经网络的视觉运动策略使机器人能够执行操作任务，但仍然容易受到感知攻击。例如，传统的2D对抗补丁在固定摄像机设置下有效，因为外观相对一致；然而，由于透视失真，它们的有效性在动态视角下（如腕部安装的摄像机）往往会减弱。为了主动探讨超越2D补丁的潜在脆弱性，本研究提出了一种通过可微渲染进行3D物体的视角一致对抗纹理优化方法。作为优化策略，我们采用了基于变换的期望（Expectation over Transformation, EOT）与粗到细（Coarse-to-Fine, C2F）课程，利用距离依赖的频率特征来诱导在不同摄像机-物体距离下有效的纹理。我们进一步整合了显著性引导的扰动，以重新引导策略的注意力，并设计了一种针对性的损失函数，持续驱动机器人朝向对抗物体。我们的综合实验表明，该方法在各种环境条件下有效，同时确认了其黑箱可迁移性和现实世界的适用性。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2603.04914

U-OBCA: Uncertainty-Aware Optimization-Based Collision Avoidance via Wasserstein Distributionally Robust Chance Constraints

U-OBCA：基于优化的碰撞避免的不确定性感知方法，通过Wasserstein分布鲁棒机会约束

Wang, Zehao, Tang, Yuxuan, Zhang, Han, Wang, Jingchuan, Chen, Weidong

Abstract

Uncertainties arising from localization error, trajectory prediction errors of the moving obstacles and environmental disturbances pose significant challenges to robot's safe navigation. Existing uncertainty-aware planners often approximate polygon-shaped robots and obstacles using simple geometric primitives such as circles or ellipses. Though computationally convenient, these approximations substantially shrink the feasible space, leading to overly conservative trajectories and even planning failure in narrow environments. In addition, many such methods rely on specific assumptions about noise distributions, which may not hold in practice and thus limit their performance guarantees. To address these limitations, we extend the Optimization-Based Collision Avoidance (OBCA) framework to an uncertainty-aware formulation, termed \emph{U-OBCA}. The proposed method explicitly accounts for the collision risk between polygon-shaped robots and obstacles by formulating OBCA-based chance constraints, and hence avoiding geometric simplifications and reducing unnecessary conservatism. These probabilistic constraints are further tightened into deterministic nonlinear constraints under mild distributional assumptions, which can be solved efficiently by standard numerical optimization solvers. The proposed approach is validated through theoretical analysis, numerical simulations and real-world experiments. The results demonstrate that U-OBCA significantly mitigates the conservatism in trajectory planning and achieves higher navigation efficiency compared to existing baseline methods, particularly in narrow and cluttered environments.

Chinese Translation

来自定位误差、移动障碍物的轨迹预测误差和环境干扰的不确定性对机器人的安全导航构成了重大挑战。现有的不确定性感知规划方法通常使用简单的几何原语（如圆形或椭圆形）来近似多边形形状的机器人和障碍物。尽管这种近似在计算上方便，但它们显著缩小了可行空间，导致过于保守的轨迹，甚至在狭窄环境中出现规划失败。此外，许多此类方法依赖于对噪声分布的特定假设，这在实践中可能不成立，从而限制了它们的性能保证。为了解决这些局限性，我们将基于优化的碰撞避免（OBCA）框架扩展为一种不确定性感知的形式，称为U-OBCA。所提出的方法通过制定基于OBCA的机会约束，明确考虑多边形形状的机器人和障碍物之间的碰撞风险，从而避免几何简化并减少不必要的保守性。这些概率约束在温和的分布假设下进一步收紧为确定性的非线性约束，可以通过标准数值优化求解器高效求解。通过理论分析、数值仿真和实际实验验证了所提出的方法。结果表明，U-OBCA显著减轻了轨迹规划中的保守性，并在狭窄和杂乱的环境中相比现有基线方法实现了更高的导航效率。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2603.04932

Integrated cooperative localization of heterogeneous measurement swarm: A unified data-driven method

异构测量群体的集成协作定位：一种统一的数据驱动方法

Ze, Kunrui, Wang, Wei, Sun, Guibin, Yan, Jiaqi, Liu, Kexin, Lü, Jinhu

Abstract

The cooperative localization (CL) problem in heterogeneous robotic systems with different measurement capabilities is investigated in this work. In practice, heterogeneous sensors lead to directed and sparse measurement topologies, whereas most existing CL approaches rely on multilateral localization with restrictive multi-neighbor geometric requirements. To overcome this limitation, we enable pairwise relative localization (RL) between neighboring robots using only mutual measurement and odometry information. A unified data-driven adaptive RL estimator is first developed to handle heterogeneous and unidirectional measurements. Based on the convergent RL estimates, a distributed pose-coupling CL strategy is then designed, which guarantees CL under a weakly connected directed measurement topology, representing the least restrictive condition among existing results. The proposed method is independent of specific control tasks and is validated through a formation control application and real-world experiments.

Chinese Translation

本文研究了异构机器人系统中具有不同测量能力的协作定位（CL）问题。在实际应用中，异构传感器导致了定向和稀疏的测量拓扑，而现有的大多数CL方法依赖于具有限制性多邻居几何要求的多边定位。为了克服这一限制，我们实现了邻近机器人之间的成对相对定位（RL），仅使用互相测量和里程计信息。首先开发了一种统一的数据驱动自适应RL估计器，以处理异构和单向测量。基于收敛的RL估计，设计了一种分布式姿态耦合CL策略，该策略在弱连接的定向测量拓扑下保证CL，代表了现有结果中最不严格的条件。所提出的方法独立于特定控制任务，并通过编队控制应用和实际实验进行了验证。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2603.05015

Observer Design for Augmented Reality-based Teleoperation of Soft Robots

基于增强现实的软机器人遥操作观察者设计

García-Samartín, Jorge Francisco, Pérez, Iago López, Yolcu, Emirhan, del Cerro, Jaime, Barrientos, Antonio

Abstract

Although virtual and augmented reality are gaining traction as teleoperation tools for various types of robots, including manipulators and mobile robots, they are not being used for soft robots. The inherent difficulties of modelling soft robots mean that combining accurate and computationally efficient representations is very challenging. This paper presents an augmented reality interface for teleoperating these devices. The developed system consists of Microsoft HoloLens 2 glasses and a central computer responsible for calculations. Validation is performed on PETER, a highly modular pneumatic manipulator. Using data collected from sensors, the computer estimates the robot's position based on the physics of the virtual reality programme. Errors obtained are on the order of 5% of the robot's length, demonstrating that augmented reality facilitates operator interaction with soft manipulators and can be integrated into the control loop.

Chinese Translation

尽管虚拟现实和增强现实作为遥操作工具在各种类型的机器人（包括操纵器和移动机器人）中逐渐受到关注，但它们尚未被应用于软机器人。软机器人建模的固有困难使得准确且计算高效的表示结合变得非常具有挑战性。本文提出了一种用于遥操作这些设备的增强现实接口。所开发的系统由微软 HoloLens 2 眼镜和负责计算的中央计算机组成。验证是在高度模块化的气动操纵器 PETER 上进行的。计算机利用从传感器收集的数据，根据虚拟现实程序的物理模型估计机器人的位置。获得的误差约为机器人长度的 5%，表明增强现实促进了操作者与软操纵器的交互，并可以集成到控制回路中。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2603.05017

Direct Contact-Tolerant Motion Planning With Vision Language Models

基于视觉语言模型的直接接触容忍运动规划

Li, He, Sun, Jian, Li, Chengyang, Li, Guoliang, Ruan, Qiyu, Wang, Shuai, Xu, Chengzhong

Abstract

Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Chinese Translation

在杂乱环境中的导航通常要求机器人能够容忍与可移动或可变形物体的接触，以保持效率。现有的接触容忍运动规划（CTMP）方法依赖于间接的空间表示（例如，预构建地图、障碍物集合），导致不准确性和对环境不确定性的适应性不足。为了解决这个问题，我们提出了一种直接接触容忍（DCT）规划器，该规划器将视觉语言模型（VLM）集成到直接点感知和导航中，包括两个关键组件。第一个是VLM点云分割器（VPP），它在图像空间中使用VLM进行接触容忍推理，缓存推理掩码，通过里程计在帧之间传播，并将其投影到当前扫描中以生成接触感知点云。第二个创新是VPP引导导航（VGN），它将CTMP公式化为在直接接触感知点云约束下的感知到控制优化问题，并通过专门的深度神经网络（DNN）进一步求解。我们在Isaac Sim和一个真实的类汽车机器人中实现了DCT，证明DCT在具有可移动障碍物的杂乱环境中实现了稳健和高效的导航，超越了多种指标下的代表性基线。代码可在以下链接获取：https://github.com/ChrisLeeUM/DCT。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2603.05070

VinePT-Map: Pole-Trunk Semantic Mapping for Resilient Autonomous Robotics in Vineyards

VinePT-Map：用于葡萄园中具有韧性的自主机器人杆干语义映射

Audrito, Giorgio, Martini, Mauro, Navone, Alessandro, Galluzzo, Giorgia, Chiaberge, Marcello

Abstract

Reliable long-term deployment of autonomous robots in agricultural environments remains challenging due to perceptual aliasing, seasonal variability, and the dynamic nature of crop canopies. Vineyards, characterized by repetitive row structures and significant visual changes across phenological stages, represent a pivotal field challenge, limiting the robustness of conventional feature-based localization and mapping approaches. This paper introduces VinePT-Map, a semantic mapping framework that leverages vine trunks and support poles as persistent structural landmarks to enable season-agnostic and resilient robot localization. The proposed method formulates the mapping problem as a factor graph, integrating GPS, IMU, and RGB-D observations through robust geometrical constraints that exploit vineyard structure. An efficient perception pipeline based on instance segmentation and tracking, combined with a clustering filter for outlier rejection and pose refinement, enables accurate landmark detection using low-cost sensors and onboard computation. To validate the pipeline, we present a multi-season dataset for trunk and pole segmentation and tracking. Extensive field experiments conducted across diverse seasons demonstrate the robustness and accuracy of the proposed approach, highlighting its suitability for long-term autonomous operation in agricultural environments.

Chinese Translation

在农业环境中可靠的长期部署自主机器人仍然面临挑战，这主要是由于感知混淆、季节性变化以及作物树冠的动态特性。葡萄园以重复的行结构和在物候阶段之间显著的视觉变化为特征，代表了一个关键的领域挑战，限制了传统基于特征的定位和映射方法的鲁棒性。本文介绍了VinePT-Map，一种语义映射框架，利用葡萄藤干和支撑杆作为持久的结构性地标，以实现与季节无关且具有韧性的机器人定位。所提出的方法将映射问题形式化为因子图，通过利用葡萄园结构的稳健几何约束，将GPS、IMU和RGB-D观测数据整合在一起。基于实例分割和跟踪的高效感知管道，结合用于异常值拒绝和姿态优化的聚类滤波器，使得能够使用低成本传感器和车载计算实现准确的地标检测。为了验证该管道，我们提供了一个用于干和杆分割及跟踪的多季节数据集。在不同季节进行的大规模实地实验展示了所提方法的鲁棒性和准确性，突显了其在农业环境中长期自主操作的适用性。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2603.05097

AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model

AIM-SLAM：通过基础模型实现自适应和信息丰富的多视角关键帧优先级排序的稠密单目SLAM

Jeon, Jinwoo, Seo, Dong-Uk, Lee, Eungchang Mason, Myung, Hyun

Abstract

Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi-view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art performance in both pose estimation and dense reconstruction. Our system supports ROS integration, with code is available at https://aimslam.github.io/.

Chinese Translation

最近，几何基础模型的进展被视为解决单目视觉同时定位与地图构建（SLAM）中稠密重建挑战的有希望的替代方案。尽管几何基础模型使SLAM能够利用可变输入视图，但之前的方法仍然局限于两视图对或固定长度输入，且在视图选择时对几何上下文的考虑不足。为了解决这个问题，我们提出了AIM-SLAM，一个稠密单目SLAM框架，利用来自视觉几何基础变换器（VGGT）的稠密点图预测进行自适应和信息丰富的多视角关键帧优先级排序。具体而言，我们引入了选择性信息和几何感知的多视角适应（SIGMA）模块，该模块利用体素重叠和信息增益来检索候选关键帧集，并自适应地确定其大小。此外，我们制定了联合多视角Sim(3)优化，强制选定视图之间的一致对齐，显著提高了姿态估计的准确性。AIM-SLAM的有效性在真实世界数据集上得到了验证，在姿态估计和稠密重建方面均达到了最先进的性能。我们的系统支持ROS集成，代码可在https://aimslam.github.io/获取。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2603.05108

GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins

GaussTwin：基于高斯点云的机器人数字双胞胎统一仿真与校正

Cai, Yichen, Jansonnie, Paul, de Farias, Cristiana, Arenz, Oleg, Peters, Jan

Abstract

Digital twins promise to enhance robotic manipulation by maintaining a consistent link between real-world perception and simulation. However, most existing systems struggle with the lack of a unified model, complex dynamic interactions, and the real-to-sim gap, which limits downstream applications such as model predictive control. Thus, we propose GaussTwin, a real-time digital twin that combines position-based dynamics with discrete Cosserat rod formulations for physically grounded simulation, and Gaussian splatting for efficient rendering and visual correction. By anchoring Gaussians to physical primitives and enforcing coherent SE(3) updates driven by photometric error and segmentation masks, GaussTwin achieves stable prediction-correction while preserving physical fidelity. Through experiments in both simulation and on a Franka Research 3 platform, we show that GaussTwin consistently improves tracking accuracy and robustness compared to shape-matching and rigid-only baselines, while also enabling downstream tasks such as push-based planning. These results highlight GaussTwin as a step toward unified, physically meaningful digital twins that can support closed-loop robotic interaction and learning.

Chinese Translation

数字双胞胎有望通过保持现实世界感知与仿真之间的一致联系来增强机器人操作能力。然而，现有的大多数系统面临缺乏统一模型、复杂动态交互以及现实与仿真之间的差距等问题，这限制了模型预测控制等下游应用。因此，我们提出了GaussTwin，这是一种实时数字双胞胎，结合了基于位置的动力学与离散Cosserat杆模型进行物理基础的仿真，并采用高斯点云进行高效渲染和视觉校正。通过将高斯锚定到物理原语，并通过光度误差和分割掩膜强制一致的SE(3)更新，GaussTwin实现了稳定的预测-校正，同时保持物理真实性。通过在仿真和Franka Research 3平台上的实验，我们表明GaussTwin在跟踪精度和鲁棒性方面始终优于形状匹配和仅刚性基线，同时也支持推力规划等下游任务。这些结果突显了GaussTwin作为朝向统一、具有物理意义的数字双胞胎的一步，能够支持闭环机器人交互与学习。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2603.05111

SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty

SPIRIT：在深度学习不确定性下的稳健机器人操作的感知共享自主性

Lee, Jongseok, Balachandran, Ribin, Singh, Harsimran, Feng, Jianxiang, Mishra, Hrishik, De Stefano, Marco, Triebel, Rudolph, Albu-Schaeffer, Alin, Kondak, Konstantin

Abstract

Deep learning (DL) has enabled impressive advances in robotic perception, yet its limited robustness and lack of interpretability hinder reliable deployment in safety critical applications. We propose a concept termed perceptive shared autonomy, in which uncertainty estimates from DL based perception are used to regulate the level of autonomy. Specifically, when the robot's perception is confident, semi-autonomous manipulation is enabled to improve performance; when uncertainty increases, control transitions to haptic teleoperation for maintaining robustness. In this way, high-performing but uninterpretable DL methods can be integrated safely into robotic systems. A key technical enabler is an uncertainty aware DL based point cloud registration approach based on the so called Neural Tangent Kernels (NTK). We evaluate perceptive shared autonomy on challenging aerial manipulation tasks through a user study of 15 participants and realization of mock-up industrial scenarios, demonstrating reliable robotic manipulation despite failures in DL based perception. The resulting system, named SPIRIT, improves both manipulation performance and system reliability. SPIRIT was selected as a finalist of a major industrial innovation award.

Chinese Translation

深度学习（DL）在机器人感知方面取得了显著进展，但其有限的鲁棒性和缺乏可解释性阻碍了在安全关键应用中的可靠部署。我们提出了一种称为感知共享自主性的概念，其中利用基于深度学习的感知的不确定性估计来调节自主性水平。具体而言，当机器人的感知具有信心时，启用半自主操作以提高性能；当不确定性增加时，控制转变为触觉遥操作以保持鲁棒性。通过这种方式，高性能但不可解释的深度学习方法可以安全地集成到机器人系统中。一个关键的技术支持是基于所谓的神经切线核（Neural Tangent Kernels, NTK）的不确定性感知深度学习点云配准方法。我们通过对15名参与者的用户研究和模拟工业场景的实现，评估了感知共享自主性在具有挑战性的空中操作任务中的表现，尽管在基于深度学习的感知中存在失败，仍然展示了可靠的机器人操作。最终系统命名为SPIRIT，提升了操作性能和系统可靠性。SPIRIT被选为一项主要工业创新奖的决赛入围者。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2603.05117

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

SeedPolicy：通过自演化扩散策略实现机器人操作的视野扩展

Gui, Youqiang, Zhou, Yuxuan, Cheng, Shen, Yuan, Xinyang, Fan, Haoqiang, Cheng, Peng, Liu, Shuaicheng

Abstract

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.

Chinese Translation

模仿学习（Imitation Learning, IL）使机器人能够从专家演示中获取操作技能。扩散策略（Diffusion Policy, DP）模型能够模拟多模态专家行为，但随着观察视野的增加，其性能会下降，从而限制了长视野操作的能力。我们提出了自演化门控注意力（Self-Evolving Gated Attention, SEGA），这是一个时间模块，通过门控注意力维持一个随时间演变的潜在状态，实现高效的递归更新，将长视野观察压缩为固定大小的表示，同时过滤掉不相关的时间信息。将SEGA集成到DP中形成了自演化扩散策略（SeedPolicy），解决了时间建模瓶颈，并以适度的开销实现了可扩展的视野扩展。在包含50个操作任务的RoboTwin 2.0基准测试中，SeedPolicy的表现优于DP和其他IL基线。在CNN和Transformer骨干网络的平均表现中，SeedPolicy在干净环境下相较于DP实现了36.8%的相对提升，在随机挑战环境下实现了169%的相对提升。与参数量为12亿的视觉-语言-动作模型（如RDT）相比，SeedPolicy在参数量减少一到两个数量级的情况下实现了竞争力的表现，显示出强大的效率和可扩展性。这些结果确立了SeedPolicy作为一种先进的模仿学习方法，适用于长视野的机器人操作。代码可在以下链接获取：https://github.com/Youqiang-Gui/SeedPolicy。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2603.05160

Lifelong Language-Conditioned Robotic Manipulation Learning

终身语言条件下的机器人操控学习

Wang, Xudong, Han, Zebin, Liu, Zhiyu, Li, Gan, Dong, Jiahua, Liu, Baichen, Liu, Lianqing, Han, Zhi

Abstract

Traditional language-conditioned manipulation agent sequential adaptation to new manipulation skills leads to catastrophic forgetting of old skills, limiting dynamic scene practical deployment. In this paper, we propose SkillsCrafter, a novel robotic manipulation framework designed to continually learn multiple skills while reducing catastrophic forgetting of old skills. Specifically, we propose a Manipulation Skills Adaptation to retain the old skills knowledge while inheriting the shared knowledge between new and old skills to facilitate learning of new skills. Meanwhile, we perform the singular value decomposition on the diverse skill instructions to obtain common skill semantic subspace projection matrices, thereby recording the essential semantic space of skills. To achieve forget-less and generalization manipulation, we propose a Skills Specialization Aggregation to compute inter-skills similarity in skill semantic subspaces, achieving aggregation of the previously learned skill knowledge for any new or unknown skill. Extensive experiments demonstrate the effectiveness and superiority of our proposed SkillsCrafter.

Chinese Translation

传统的语言条件操控代理在对新操控技能进行顺序适应时，会导致对旧技能的灾难性遗忘，从而限制了动态场景的实际部署。本文提出了一种新颖的机器人操控框架SkillsCrafter，旨在持续学习多种技能，同时减少对旧技能的灾难性遗忘。具体而言，我们提出了一种操控技能适应方法，以保留旧技能知识，同时继承新旧技能之间的共享知识，以促进新技能的学习。同时，我们对多样化的技能指令进行奇异值分解，以获得共同的技能语义子空间投影矩阵，从而记录技能的基本语义空间。为了实现无遗忘和泛化操控，我们提出了一种技能专业化聚合方法，以计算技能语义子空间中的技能间相似性，从而实现对任何新技能或未知技能的先前学习技能知识的聚合。大量实验表明，我们提出的SkillsCrafter的有效性和优越性。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2603.05185

Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation

循环中的评论者：一种用于鲁棒长时间操作的三系统视觉-语言-动作框架

Yi, Pengfei, Ma, Yingjie, Xu, Wenjiang, Hao, Yanan, Gan, Shuai, Li, Wanting, Zhong, Shanlin

Abstract

Balancing high-level semantic reasoning with low-level reactive control remains a core challenge in visual robotic manipulation. While Vision-Language Models (VLMs) excel at cognitive planning, their inference latency precludes real-time execution. Conversely, fast Vision-Language-Action (VLA) models often lack the semantic depth required for complex, long-horizon tasks. To bridge this gap, we introduce Critic in the Loop, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling. At its core is a bionic Tri-System architecture comprising a VLM brain for global reasoning, a VLA cerebellum for reactive execution, and a lightweight visual Critic. By continuously monitoring the workspace, the Critic dynamically routes control authority. It sustains rapid closed-loop execution via the VLA for routine subtasks, and adaptively triggers the VLM for replanning upon detecting execution anomalies such as task stagnation or failures. Furthermore, our architecture seamlessly integrates human-inspired rules to intuitively break infinite retry loops. This visually-grounded scheduling minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios. Comprehensive experiments on challenging, long-horizon manipulation benchmarks reveal that our approach achieves state-of-the-art performance.

Chinese Translation

在视觉机器人操作中，平衡高层语义推理与低层反应控制仍然是一个核心挑战。虽然视觉-语言模型（VLMs）在认知规划方面表现出色，但其推理延迟阻碍了实时执行。相反，快速的视觉-语言-动作（VLA）模型往往缺乏复杂长时间任务所需的语义深度。为了解决这一问题，我们提出了循环中的评论者（Critic in the Loop），这是一个由动态VLM-专家调度驱动的自适应分层框架。其核心是一个仿生三系统架构，包括一个用于全局推理的VLM大脑、一个用于反应执行的VLA小脑，以及一个轻量级视觉评论者。评论者通过持续监控工作空间，动态地分配控制权。它通过VLA维持常规子任务的快速闭环执行，并在检测到执行异常（如任务停滞或失败）时，自适应地触发VLM进行重新规划。此外，我们的架构无缝集成了人类启发的规则，以直观地打破无限重试循环。这种基于视觉的调度最小化了昂贵的VLM查询，同时显著增强了系统在分布外（OOD）场景中的鲁棒性和自主性。在具有挑战性的长时间操作基准上的全面实验表明，我们的方法达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2603.05252

Rethinking the Role of Collaborative Robots in Rehabilitation

重新思考协作机器人在康复中的角色

Gupte, Vivek, Rajapakshe, Shalutha, Senft, Emmanuel

Abstract

Current research on collaborative robots (cobots) in physical rehabilitation largely focuses on repeated motion training for people undergoing physical therapy (PuPT), even though these sessions include phases that could benefit from robotic collaboration and assistance. Meanwhile, access to physical therapy remains limited for people with disabilities and chronic illnesses. Cobots could support both PuPT and therapists, and improve access to therapy, yet their broader potential remains underexplored. We propose extending the scope of cobots by imagining their role in assisting therapists and PuPT before, during, and after a therapy session. We discuss how cobot assistance may lift access barriers by promoting ability-based therapy design and helping therapists manage their time and effort. Finally, we highlight challenges to realizing these roles, including advancing user-state understanding, ensuring safety, and integrating cobots into therapists' workflow. This view opens new research questions and opportunities to draw from the HRI community's advances in assistive robotics.

Chinese Translation

当前关于协作机器人（cobots）在物理康复中的研究主要集中在对接受物理治疗（PuPT）的人进行重复运动训练，尽管这些治疗阶段中有许多环节可以受益于机器人的协作和辅助。同时，残疾人士和慢性病患者获得物理治疗的机会仍然有限。协作机器人可以支持PuPT和治疗师，并改善治疗的可及性，但它们更广泛的潜力仍未得到充分探索。我们建议通过设想协作机器人在治疗师和PuPT之间在治疗前、治疗中和治疗后提供支持的角色，来扩展协作机器人的应用范围。我们讨论了协作机器人如何通过促进基于能力的治疗设计以及帮助治疗师管理时间和精力，来消除获取治疗的障碍。最后，我们强调了实现这些角色所面临的挑战，包括提升用户状态理解、确保安全性以及将协作机器人融入治疗师的工作流程。这一观点为研究提供了新的问题和机会，可以借鉴人机交互（HRI）领域在辅助机器人方面的进展。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2603.05268

Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups

曲线诱导的黎曼流形和李群上的动态系统

Bakker, Saray, Schonger, Martin, Löw, Tobias, Alonso-Mora, Javier, Calinon, Sylvain

Abstract

Deploying robots in household environments requires safe, adaptable, and interpretable behaviors that respect the geometric structure of tasks. Often represented on Lie groups and Riemannian manifolds, this includes poses on SE(3) or symmetric positive definite matrices encoding stiffness or damping matrices. In this context, dynamical system-based approaches offer a natural framework for generating such behavior, providing stability and convergence while remaining responsive to changes in the environment. We introduce Curve-induced Dynamical systems on Smooth Manifolds (CDSM), a real-time framework for constructing dynamical systems directly on Riemannian manifolds and Lie groups. The proposed approach constructs a nominal curve on the manifold, and generates a dynamical system which combines a tangential component that drives motion along the curve and a normal component that attracts the state toward the curve. We provide a stability analysis of the resulting dynamical system and validate the method quantitatively. On an S2 benchmark, CDSM demonstrates improved trajectory accuracy, reduced path deviation, and faster generation and query times compared to state-of-the-art methods. Finally, we demonstrate the practical applicability of the framework on both a robotic manipulator, where poses on SE(3) and damping matrices on SPD(n) are adapted online, and a mobile manipulator.

Chinese Translation

在家庭环境中部署机器人需要安全、适应性强且可解释的行为，这些行为必须尊重任务的几何结构。这通常在李群和黎曼流形上表示，包括在 SE(3) 上的姿态或编码刚度或阻尼矩阵的对称正定矩阵。在这种背景下，基于动态系统的方法提供了一个自然的框架来生成这样的行为，提供稳定性和收敛性，同时对环境变化保持响应。我们引入了光滑流形上的曲线诱导动态系统（Curve-induced Dynamical systems on Smooth Manifolds, CDSM），这是一个实时框架，用于直接在黎曼流形和李群上构建动态系统。所提出的方法在流形上构建一个名义曲线，并生成一个动态系统，该系统结合了一个沿曲线驱动运动的切向分量和一个将状态吸引到曲线上的法向分量。我们提供了所得到的动态系统的稳定性分析，并对该方法进行了定量验证。在 S2 基准测试中，CDSM 显示出相较于最先进的方法，轨迹精度提高、路径偏差减少以及生成和查询时间更快。最后，我们在一个机器人操纵器上展示了该框架的实际应用，其中 SE(3) 上的姿态和 SPD(n) 上的阻尼矩阵在线适应，以及在一个移动操纵器上的应用。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2603.05279

From Code to Road: A Vehicle-in-the-Loop and Digital Twin-Based Framework for Central Car Server Testing in Autonomous Driving

从代码到道路：基于车辆环路和数字双胞胎的自主驾驶中央车载服务器测试框架

Wu, Chengdong, Kirchner, Sven, Purschke, Nils, Torschmied, Axel, Kroth, Norbert, Song, Yinglei, Schamschurko, André, Haß, Erik Leo, Chao, Kuo-Yi, Zhang, Yi, Petrovic, Nenad, Knoll, Alois C.

Abstract

Simulation is one of the most essential parts in the development stage of automotive software. However, purely virtual simulations often struggle to accurately capture all real-world factors due to limitations in modeling. To address this challenge, this work presents a test framework for automotive software on the centralized E/E architecture, which is a central car server in our case, based on Vehicle-in-the-Loop (ViL) and digital twin technology. The framework couples a physical test vehicle on a dynamometer test bench with its synchronized virtual counterpart in a simulation environment. Our approach provides a safe, reproducible, realistic, and cost-effective platform for validating autonomous driving algorithms with a centralized architecture. This test method eliminates the need to test individual physical ECUs and their communication protocols separately. In contrast to traditional ViL methods, the proposed framework runs the full autonomous driving software directly on the vehicle hardware after the simulation process, eliminating flashing and intermediate layers while enabling seamless virtual-physical integration and accurately reflecting centralized E/E behavior. In addition, incorporating mixed testing in both simulated and physical environments reduces the need for full hardware integration during the early stages of automotive development. Experimental case studies demonstrate the effectiveness of the framework in different test scenarios. These findings highlight the potential to reduce development and integration efforts for testing autonomous driving pipelines in the future.

Chinese Translation

仿真是汽车软件开发阶段中最重要的部分之一。然而，纯虚拟仿真由于建模的局限性，往往难以准确捕捉所有现实世界因素。为了解决这一挑战，本文提出了一种基于车辆环路（Vehicle-in-the-Loop, ViL）和数字双胞胎技术的集中式电子/电气架构下的汽车软件测试框架，以中央车载服务器为例。该框架将一辆物理测试车辆与其在仿真环境中同步的虚拟对应物结合在一起，运行在一个测功机测试台上。我们的方法提供了一个安全、可重复、真实且具有成本效益的平台，用于验证具有集中式架构的自主驾驶算法。这种测试方法消除了单独测试各个物理电子控制单元（ECU）及其通信协议的需要。与传统的ViL方法相比，所提出的框架在仿真过程后直接在车辆硬件上运行完整的自主驾驶软件，消除了闪存和中间层，同时实现了无缝的虚拟-物理集成，并准确反映集中式电子/电气行为。此外，在模拟和物理环境中结合混合测试，减少了汽车开发早期阶段对完整硬件集成的需求。实验案例研究展示了该框架在不同测试场景中的有效性。这些发现突显了未来减少自主驾驶管道测试的开发和集成工作量的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2603.05291

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

基于语言条件的操作的分层扩散策略的迭代在线优化

Grislain, Clemence, Sigaud, Olivier, Chetouani, Mohamed

Abstract

Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller. However, these hierarchical agents often fail because the planner generates subgoals without considering the actual limitations of the controller. Existing solutions attempt to bridge this gap via intermediate modules or shared representations, but they remain limited by their reliance on fixed offline datasets. We propose HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback. HD-ExpIt organizes training into a self-reinforcing cycle: it utilizes diffusion-based planning to autonomously discover successful behaviors, which are then distilled back into the hierarchical policy. This loop enables both components to improve while implicitly grounding the planner in the controller's actual capabilities without requiring explicit proxy models. Empirically, HD-ExpIt significantly improves hierarchical policies trained solely on offline data, achieving state-of-the-art performance on the long-horizon CALVIN benchmark among methods trained from scratch.

Chinese Translation

基于语言条件的操作的分层策略将任务分解为子目标，其中高层规划者指导低层控制器。然而，这些分层代理通常会失败，因为规划者生成的子目标未考虑控制器的实际限制。现有解决方案试图通过中间模块或共享表示来弥补这一差距，但仍然受到对固定离线数据集依赖的限制。我们提出了 HD-ExpIt，一个通过环境反馈进行分层扩散策略迭代微调的框架。HD-ExpIt 将训练组织成一个自我强化的循环：它利用基于扩散的规划自主发现成功行为，然后将这些行为提炼回分层策略中。这个循环使得两个组件能够在不需要显式代理模型的情况下，同时改进，同时隐性地将规划者的能力与控制器的实际能力相结合。从实证上看，HD-ExpIt 显著改善了仅基于离线数据训练的分层策略，在从零开始训练的方法中，在长时间跨度的 CALVIN 基准测试中达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2603.05296

Latent Policy Steering through One-Step Flow Policies

通过一步流策略进行潜在策略引导

Im, Hokyun, Kolobov, Andrey, Fu, Jianlong, Lee, Youngwoon

Abstract

Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.

Chinese Translation

离线强化学习（RL）使机器人能够从离线数据集中学习，而无需进行风险探索。然而，离线RL的性能往往依赖于（1）收益最大化，这可能会使策略超出数据集支持范围，以及（2）行为约束，这通常需要敏感的超参数调优。潜在引导提供了一种在RL过程中保持在数据集支持范围内的结构性方法，但现有的离线适应通常通过间接蒸馏学习的潜在空间评论员来近似动作值，这可能会丢失信息并阻碍收敛。我们提出了潜在策略引导（Latent Policy Steering, LPS），它通过通过可微分的一步MeanFlow策略反向传播原始动作空间的Q梯度来更新潜在动作空间的演员，从而实现高保真度的潜在策略改进。通过消除代理潜在评论员，LPS允许原始动作空间的评论员引导端到端的潜在空间优化，而一步MeanFlow策略则作为行为约束的生成先验。这种解耦产生了一种稳健的方法，能够开箱即用，且调优最小。在OGBench和现实世界的机器人任务中，LPS实现了最先进的性能，并始终优于行为克隆和强大的潜在引导基线。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2603.05309

Constraint-Free Static Modeling of Continuum Parallel Robot

无约束静态建模的连续并联机器人

Xun, Lingxiao, Diezinger, Matyas, Artinian, Azad, Laurent, Guillaume, Tamadazte, Brahim

Abstract

Continuum parallel robots (CPR) combine rigid actuation mechanisms with multiple elastic rods in a closed-loop topology, making forward statics challenging when rigid--continuum junctions are enforced by explicit kinematic constraints. Such constraint-based formulations typically introduce additional algebraic variables and complicate both numerical solution and downstream control. This paper presents a geometric exact, configuration-based and constraint-free static model of CPR that remains valid under geometrically nonlinear, large-deformation and large-rotation conditions. Connectivity constraints are eliminated by kinematic embedding, yielding a reduced unconstrained problem. Each rod of CPR is discretized by nodal poses on SE(3), while the element-wise strain field is reconstructed through a linear strain parameterization. A fourth-order Magnus approximation yields an explicit and geometrically consistent mapping between element end poses and the strain. Rigid attachments at the motor-driven base and the end-effector platforms are handled through kinematic embeddings. Based on total potential energy and virtual work, we derive assembly-ready residuals and explicit Newton tangents, and solve the resulting nonlinear equilibrium equations using a Riemannian Newton iteration on the product manifold. Experiments on a three-servomotor, six-rod prototype validate the model by showing good agreement between simulation and measurements for both unloaded motions and externally loaded cases.

Chinese Translation

连续并联机器人（CPR）将刚性驱动机制与多个弹性杆结合在一个闭环拓扑中，使得在施加刚性-连续连接的显式运动约束时，正向静态分析变得具有挑战性。这种基于约束的公式通常会引入额外的代数变量，从而使数值解法和后续控制变得复杂。本文提出了一种几何精确、基于配置且无约束的CPR静态模型，该模型在几何非线性、大变形和大旋转条件下仍然有效。通过运动嵌入消除了连接约束，从而得到了一个简化的无约束问题。CPR的每根杆通过在SE(3)上的节点姿态进行离散化，而元素级的应变场则通过线性应变参数化进行重建。四阶马格努斯近似提供了元素末端姿态与应变之间的显式且几何一致的映射。通过运动嵌入处理电机驱动基座和末端执行器平台的刚性连接。基于总势能和虚功，我们推导出组装就绪的残差和显式牛顿切线，并使用在乘积流形上的黎曼牛顿迭代法求解得到的非线性平衡方程。对一个三伺服电机、六根杆的原型进行的实验验证了该模型，显示出在无载运动和外部载荷情况下，仿真与测量之间具有良好的一致性。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2603.05312

UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data

UltraDexGrasp：基于合成数据学习双手机器人通用灵巧抓取

Yang, Sizhe, Xie, Yiman, Liang, Zhixuan, Tian, Yang, Zeng, Jia, Lin, Dahua, Pang, Jiangmiao

Abstract

Grasping is a fundamental capability for robots to interact with the physical world. Humans, equipped with two hands, autonomously select appropriate grasp strategies based on the shape, size, and weight of objects, enabling robust grasping and subsequent manipulation. In contrast, current robotic grasping remains limited, particularly in multi-strategy settings. Although substantial efforts have targeted parallel-gripper and single-hand grasping, dexterous grasping for bimanual robots remains underexplored, with data being a primary bottleneck. Achieving physically plausible and geometrically conforming grasps that can withstand external wrenches poses significant challenges. To address these issues, we introduce UltraDexGrasp, a framework for universal dexterous grasping with bimanual robots. The proposed data-generation pipeline integrates optimization-based grasp synthesis with planning-based demonstration generation, yielding high-quality and diverse trajectories across multiple grasp strategies. With this framework, we curate UltraDexGrasp-20M, a large-scale, multi-strategy grasp dataset comprising 20 million frames across 1,000 objects. Based on UltraDexGrasp-20M, we further develop a simple yet effective grasp policy that takes point clouds as input, aggregates scene features via unidirectional attention, and predicts control commands. Trained exclusively on synthetic data, the policy achieves robust zero-shot sim-to-real transfer and consistently succeeds on novel objects with varied shapes, sizes, and weights, attaining an average success rate of 81.2% in real-world universal dexterous grasping. To facilitate future research on grasping with bimanual robots, we open-source the data generation pipeline at https://github.com/InternRobotics/UltraDexGrasp.

Chinese Translation

抓取是机器人与物理世界互动的基本能力。人类凭借双手，能够根据物体的形状、大小和重量自主选择合适的抓取策略，从而实现稳健的抓取和后续操作。相比之下，目前机器人抓取仍然有限，特别是在多策略设置中。尽管已有大量研究针对平行夹持器和单手抓取，但双手机器人的灵巧抓取仍然未得到充分探索，数据成为主要瓶颈。实现物理上合理且几何上符合的抓取，能够承受外部扭矩，面临重大挑战。为了解决这些问题，我们提出了UltraDexGrasp，一个用于双手机器人通用灵巧抓取的框架。该数据生成管道将基于优化的抓取合成与基于规划的演示生成相结合，产生高质量和多样化的轨迹，涵盖多种抓取策略。通过该框架，我们整理了UltraDexGrasp-20M，一个大规模多策略抓取数据集，包含1,000个物体的2,000万帧数据。基于UltraDexGrasp-20M，我们进一步开发了一种简单而有效的抓取策略，该策略以点云为输入，通过单向注意力聚合场景特征，并预测控制命令。该策略仅在合成数据上训练，实现了稳健的零-shot 模拟到现实转移，并在形状、大小和重量各异的新物体上持续成功，实际通用灵巧抓取的平均成功率达到81.2%。为了促进未来双手机器人抓取的研究，我们在 https://github.com/InternRobotics/UltraDexGrasp 上开源了数据生成管道。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2603.05333

CT-Enabled Patient-Specific Simulation and Contact-Aware Robotic Planning for Cochlear Implantation

基于CT的患者特异性仿真与接触感知机器人规划用于人工耳蜗植入

Xun, Lingxiao, Zheng, Gang, Kruszewski, Alexandre, Torres, Renato

Abstract

Robotic cochlear-implant (CI) insertion requires precise prediction and regulation of contact forces to minimize intracochlear trauma and prevent failure modes such as locking and buckling. Aligned with the integration of advanced medical imaging and robotics for autonomous, precision interventions, this paper presents a unified CT-to-simulation pipeline for contact-aware insertion planning and validation. We develop a low-dimensional, differentiable Cosserat-rod model of the electrode array coupled with frictional contact and pseudo-dynamics regularization to ensure continuous stick-slip transitions. Patient-specific cochlear anatomy is reconstructed from CT imaging and encoded via an analytic parametrization of the scala-tympani lumen, enabling efficient and differentiable contact queries through closest-point projection. Based on a differentiated equilibrium-constraint formulation, we derive an online direction-update law under an RCM-like constraint that suppresses lateral insertion forces while maintaining axial advancement. Simulations and benchtop experiments validate deformation and force trends, demonstrating reduced locking/buckling risk and improved insertion depth. The study highlights how CT-based imaging enhances modeling, planning, and safety capabilities in robot-assisted inner-ear procedures.

Chinese Translation

机器人人工耳蜗（CI）植入需要精确预测和调节接触力，以最小化耳蜗内创伤并防止如锁定和弯曲等失效模式。与先进医学成像和机器人技术在自主精确干预中的整合相一致，本文提出了一种统一的CT到仿真的管道，用于接触感知的插入规划和验证。我们开发了一种低维、可微分的Cosserat杆模型，结合摩擦接触和伪动力学正则化，以确保连续的粘滑过渡。患者特异性的耳蜗解剖结构通过CT成像重建，并通过scala-tympani腔的解析参数化进行编码，从而实现通过最近点投影进行高效且可微分的接触查询。基于差分平衡约束的公式，我们推导出在类似RCM约束下的在线方向更新法则，该法则抑制侧向插入力，同时保持轴向推进。仿真和台式实验验证了变形和力的趋势，显示出降低锁定/弯曲风险和改善插入深度。本研究强调了基于CT的成像如何增强机器人辅助内耳手术中的建模、规划和安全能力。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2603.05355

Omni-Manip: Beyond-FOV Large-Workspace Humanoid Manipulation with Omnidirectional 3D Perception

Omni-Manip：超视场大工作空间类人操控与全向三维感知

Qu, Pei, Li, Zheng, Jia, Yufei, Liu, Ziyun, Zhu, Liang, Li, Haoang, Zhou, Jinni, Ma, Jun

Abstract

The deployment of humanoid robots for dexterous manipulation in unstructured environments remains challenging due to perceptual limitations that constrain the effective workspace. In scenarios where physical constraints prevent the robot from repositioning itself, maintaining omnidirectional awareness becomes far more critical than color or semantic information. While recent advances in visuomotor policy learning have improved manipulation capabilities, conventional RGB-D solutions suffer from narrow fields of view (FOV) and self-occlusion, requiring frequent base movements that introduce motion uncertainty and safety risks. Existing approaches to expanding perception, including active vision systems and third-view cameras, introduce mechanical complexity, calibration dependencies, and latency that hinder reliable real-time performance. In this work, We propose Omni-Manip, an end-to-end LiDAR-driven 3D visuomotor policy that enables robust manipulation in large workspaces. Our method processes panoramic point clouds through a Time-Aware Attention Pooling mechanism, efficiently encoding sparse 3D data while capturing temporal dependencies. This 360{\deg} perception allows the robot to interact with objects across wide areas without frequent repositioning. To support policy learning, we develop a whole-body teleoperation system for efficient data collection on full-body coordination. Extensive experiments in simulation and real-world environments show that Omni-Manip achieves robust performance in large-workspace and cluttered scenarios, outperforming baselines that rely on egocentric depth cameras.

Chinese Translation

在非结构化环境中部署类人机器人进行灵巧操控仍然面临挑战，主要由于感知限制限制了有效工作空间。在物理约束阻止机器人重新定位的情况下，保持全向感知变得比颜色或语义信息更为关键。尽管最近在视觉运动策略学习方面的进展提高了操控能力，但传统的RGB-D解决方案受限于狭窄的视场（FOV）和自遮挡，需频繁移动底座，从而引入运动不确定性和安全风险。现有的扩展感知的方法，包括主动视觉系统和第三方视角相机，引入了机械复杂性、校准依赖性和延迟，妨碍了可靠的实时性能。在本研究中，我们提出了Omni-Manip，一种端到端的激光雷达驱动的三维视觉运动策略，能够在大工作空间中实现稳健的操控。我们的方法通过时间感知注意力池化机制处理全景点云，高效编码稀疏的三维数据，同时捕捉时间依赖性。这种360°的感知使得机器人能够在广阔区域内与物体进行交互，而无需频繁重新定位。为了支持策略学习，我们开发了一个全身遥操作系统，以便高效收集全身协调的数据。在仿真和真实环境中的大量实验表明，Omni-Manip在大工作空间和杂乱场景中实现了稳健的性能，超越了依赖自我中心深度相机的基线方法。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2603.05377

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier：基于视觉-语言的通用导航前沿

Padilla, Esteban, Sun, Boyang, Pollefeys, Marc, Blum, Hermann

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Chinese Translation

开放世界导航要求机器人在复杂的日常环境中做出决策，同时适应灵活的任务需求。传统的导航方法通常依赖于密集的3D重建和手工设计的目标度量，这限制了它们在不同任务和环境中的泛化能力。最近在视觉-语言导航（VLN）和视觉-语言-动作（VLA）模型方面的进展使得基于自然语言的端到端策略成为可能，但通常需要交互式训练、大规模数据收集或针对特定任务的微调。我们将导航问题表述为稀疏子目标识别和达成问题，并观察到为高层语义先验提供视觉锚定目标能够实现高效的目标条件导航。基于这一见解，我们选择导航前沿作为语义锚点，并提出OpenFrontier，一个无需训练的导航框架，能够无缝集成多种视觉-语言先验模型。OpenFrontier通过轻量级系统设计实现高效导航，无需密集的3D映射、策略训练或模型微调。我们在多个导航基准上评估OpenFrontier，展示了其强大的零-shot性能，以及在移动机器人上的有效实际部署。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2603.05385

Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics

通过学习的线性库普曼动力学加速基于采样的控制

Hao, Wenjian, Fang, Yuxuan, Lu, Zehui, Mou, Shaoshuai

Abstract

This paper presents an efficient model predictive path integral (MPPI) control framework for systems with complex nonlinear dynamics. To improve the computational efficiency of classic MPPI while preserving control performance, we replace the nonlinear dynamics used for trajectory propagation with a learned linear deep Koopman operator (DKO) model, enabling faster rollout and more efficient trajectory sampling. The DKO dynamics are learned directly from interaction data, eliminating the need for analytical system models. The resulting controller, termed MPPI-DK, is evaluated in simulation on pendulum balancing and surface vehicle navigation tasks, and validated on hardware through reference-tracking experiments on a quadruped robot. Experimental results demonstrate that MPPI-DK achieves control performance close to MPPI with true dynamics while substantially reducing computational cost, enabling efficient real-time control on robotic platforms.

Chinese Translation

本文提出了一种高效的模型预测路径积分（MPPI）控制框架，适用于具有复杂非线性动力学的系统。为了在保持控制性能的同时提高经典MPPI的计算效率，我们用学习到的线性深度库普曼算子（DKO）模型替代用于轨迹传播的非线性动力学，从而实现更快的展开和更高效的轨迹采样。DKO动力学直接从交互数据中学习，消除了对解析系统模型的需求。所得到的控制器称为MPPI-DK，在摆平衡和地面车辆导航任务的仿真中进行了评估，并通过在四足机器人上的参考跟踪实验进行了硬件验证。实验结果表明，MPPI-DK在控制性能上接近于使用真实动力学的MPPI，同时显著降低了计算成本，使得在机器人平台上实现高效的实时控制成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2603.05397

Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM

基于3D LiDAR的SLAM中的回环闭合通过最大团实现

Laserna, Javier, Gupta, Saurabh, Mozos, Oscar Martinez, Stachniss, Cyrill, Segundo, Pablo San

Abstract

Reliable loop closure detection remains a critical challenge in 3D LiDAR-based SLAM, especially under sensor noise, environmental ambiguity, and viewpoint variation conditions. RANSAC is often used in the context of loop closures for geometric model fitting in the presence of outliers. However, this approach may fail, leading to map inconsistency. We introduce a novel deterministic algorithm, CliReg, for loop closure validation that replaces RANSAC verification with a maximal clique search over a compatibility graph of feature correspondences. This formulation avoids random sampling and increases robustness in the presence of noise and outliers. We integrated our approach into a real- time pipeline employing binary 3D descriptors and a Hamming distance embedding binary search tree-based matching. We evaluated it on multiple real-world datasets featuring diverse LiDAR sensors. The results demonstrate that our proposed technique consistently achieves a lower pose error and more reliable loop closures than RANSAC, especially in sparse or ambiguous conditions. Additional experiments on 2D projection-based maps confirm its generality across spatial domains, making our approach a robust and efficient alternative for loop closure detection.

Chinese Translation

在基于3D LiDAR的SLAM中，可靠的回环闭合检测仍然是一个关键挑战，尤其是在传感器噪声、环境模糊和视角变化的条件下。RANSAC通常用于回环闭合的几何模型拟合，以应对存在的离群点。然而，这种方法可能会失败，导致地图不一致。我们提出了一种新颖的确定性算法CliReg，用于回环闭合验证，它用最大团搜索替代了RANSAC验证，基于特征对应的兼容性图进行搜索。这种构造避免了随机采样，并在存在噪声和离群点的情况下提高了鲁棒性。我们将该方法集成到一个实时管道中，采用二进制3D描述符和基于汉明距离嵌入的二叉搜索树匹配。我们在多个具有不同LiDAR传感器的真实世界数据集上进行了评估。结果表明，我们提出的技术在稀疏或模糊条件下，始终实现了比RANSAC更低的姿态误差和更可靠的回环闭合。此外，在基于2D投影的地图上的额外实验确认了其在空间域中的普适性，使我们的方法成为回环闭合检测的一个鲁棒且高效的替代方案。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2603.05404

ROScopter: A Multirotor Autopilot based on ROSflight 2.0

ROScopter：基于 ROSflight 2.0 的多旋翼自动驾驶仪

Moore, Jacob, Reid, Ian, Tokumaru, Phil, McLain, Tim

Abstract

ROScopter is a lean multirotor autopilot built for researchers. ROScopter seeks to accelerate simulation and hardware testing of research code with an architecture that is both easy to understand and simple to modify. ROScopter is designed to interface with ROSflight 2.0 and runs entirely on an onboard flight computer, leveraging the features of ROS 2 to improve modularity. This work describes the architecture of ROScopter and how it can be used to test application code in both simulated and hardware environments. Hardware results of the default ROScopter behavior are presented, showing that ROScopter achieves similar performance to another state-of-the-art autopilot for basic waypoint-following maneuvers, but with a significantly reduced and more modular code-base.

Chinese Translation

ROScopter 是一个为研究人员设计的精简多旋翼自动驾驶仪。ROScopter 旨在通过一种易于理解和修改的架构，加速研究代码的仿真和硬件测试。ROScopter 设计为与 ROSflight 2.0 接口，并完全在机载飞行计算机上运行，利用 ROS 2 的特性来提高模块化程度。本文描述了 ROScopter 的架构以及如何在仿真和硬件环境中测试应用代码。展示了默认 ROScopter 行为的硬件结果，表明 ROScopter 在基本航点跟随操作中实现了与另一种先进自动驾驶仪相似的性能，但代码基础显著减少且更具模块化。

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2603.05410

PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

PhysiFlow：基于物理的类人全身视觉-语言-动作通过多脑潜在流匹配与鲁棒跟踪

Qin, Weikai, Wu, Sichen, Chen, Ci, Liu, Mengfan, Feng, Linxi, Cui, Xinru, Han, Haoqi, Wang, Hesheng

Abstract

In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.

Chinese Translation

在类人机器人控制领域，将视觉-语言-动作（VLA）与全身控制相融合对于语义引导的现实任务执行至关重要。然而，现有方法在VLA推理效率低下或缺乏有效的语义引导以进行全身控制方面面临挑战，导致动态肢体协调任务的不稳定性。为了解决这一问题，我们提出了一种语义运动意图引导的、基于物理的多脑VLA框架，用于类人全身控制。我们进行了系列实验以评估所提框架的性能。实验结果表明，该框架能够实现类人机器人可靠的视觉-语言引导的全身协调。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2603.05448

Residual RL--MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow

基于残差强化学习-模型预测控制的鲁棒微机器人细胞推动在时变流中的应用

Yang, Yanda, Das, Sambeeta

Abstract

Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot--cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Experiments show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.

Chinese Translation

在微流体流动中进行接触丰富的微操控具有挑战性，因为微小的扰动可能会破坏推动接触并导致较大的横向漂移。我们研究了在时变泊肃叶流下，使用磁性滚动微机器人进行平面细胞推动，该机器人跟踪一个经过采样的参考曲线。我们提出了一种混合控制器，该控制器在名义模型预测控制（MPC）的基础上，结合了通过软演员评论家（SAC）训练的学习残差策略。该策略输出一个有界的二维速度修正，且该修正受接触限制，因此残差动作仅在机器人与细胞接触期间应用，从而保持可靠的接近行为并稳定学习。所有方法共享相同的驱动接口和速度范围，以便进行公平比较。实验结果表明，在非平稳流动下，相较于纯MPC和PID，鲁棒性和跟踪精度得到了改善，并且从三叶草训练曲线到未见过的圆形和方形轨迹具有良好的泛化能力。通过残差界限扫描确定了一个中间修正限制作为最佳折衷，我们在所有基准测试中使用该限制。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2603.05487

Observing and Controlling Features in Vision-Language-Action Models

观察和控制视觉-语言-动作模型中的特征

Buurmeijer, Hugo, Alonso, Carmen Amo, Swann, Aiden, Pavone, Marco

Abstract

Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($\pi_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.

Chinese Translation

视觉-语言-动作模型（Vision-Language-Action Models, VLA）在具身智能方面取得了显著进展。尽管其架构在某种程度上类似于大型语言模型（Large Language Models, LLM），但由于其多模态输入/输出以及变换器（transformer）和扩散头（diffusion heads）的混合特性，VLA展现出更高的复杂性。这也是为什么来自LLM的机械解释性（mechanistic interpretability）见解无法简单地转移到VLA模型的原因之一，这些见解解释了内部模型表示如何与输出行为相关联。在本研究中，我们提出通过引入和分析两个主要概念：特征可观察性（feature-observability）和特征可控性（feature-controllability），来填补这一空白。具体而言，我们首先研究在表示空间中线性编码的特征，并展示如何通过线性分类器观察这些特征。然后，我们使用基于最优控制的最小线性干预，准确地放置内部表示并引导VLA的输出朝向期望区域。我们的结果表明，针对性的、轻量级的干预可以可靠地引导机器人的行为，同时保持闭环能力。我们通过模拟实验在不同的VLA架构（$ ext{π}_{0.5}$ 和 OpenVLA）上展示，VLA具有可解释的内部结构，适合在线适应而无需微调，从而实现与用户偏好和任务要求的实时对齐。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2603.05497

Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions

Safe-SAGE：通过拉普拉斯调制的泊松安全函数实现安全参与的社会-语义自适应引导

Yang, Lizhi, Bena, Ryan M., Wilkinson, Meg, Bahati, Gilbert, Brenes, Andy Navarro, Cosner, Ryan K., Ames, Aaron D.

Abstract

Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to navigate semantically rich, dynamic environments with context-dependent safety margins while maintaining rigorous safety guarantees.

Chinese Translation

传统的安全关键控制方法，如控制屏障函数，存在语义盲区，表现出在障碍物周围无论上下文重要性如何均相同的行为。这一局限导致对所有障碍物的统一处理，尽管它们具有不同的语义含义。我们提出了Safe-SAGE（安全参与的社会-语义自适应引导），这是一个统一框架，通过使用拉普拉斯引导场调制的泊松安全函数（PSF），弥合高层语义理解与低层安全关键控制之间的差距。我们的方法通过融合多传感器点云与基于视觉的实例分割和持久对象跟踪来感知环境，以保持超出相机视野的最新语义。然后，使用多层安全过滤器调制系统输入，以利用对环境的语义理解实现安全导航。该安全过滤器由模型预测控制层和控制屏障函数层组成。这两层利用PSF和引导场的通量调制，为环境中不同障碍物引入不同程度的保守性和多智能体通过规范。我们的框架使得腿部机器人能够在语义丰富、动态的环境中导航，具备上下文相关的安全边际，同时保持严格的安全保障。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2603.05504

RoboPocket: Improve Robot Policies Instantly with Your Phone

RoboPocket：用手机即时提升机器人策略

Fang, Junjie, Chen, Wendi, Xue, Han, Zhou, Fangyuan, Le, Tian, Wang, Yi, Zhang, Yuting, Lv, Jun, Wen, Chuan, Lu, Cewu

Abstract

Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.

Chinese Translation

模仿学习的扩展在根本上受到数据收集效率的限制。虽然手持接口已成为一种可扩展的野外数据获取解决方案，但它们主要以开放循环的方式运行：操作员在不知道潜在策略弱点的情况下盲目收集演示，导致对关键状态分布的覆盖效率低下。相反，像DAgger这样的交互式方法有效地解决了协变量偏移问题，但依赖于物理机器人执行，这既昂贵又难以扩展。为了解决这一权衡，我们提出了RoboPocket，一个便携式系统，能够使用单个消费级智能手机实现无机器人即时策略迭代。其核心创新是一个远程推理框架，通过增强现实（AR）视觉前瞻可视化策略的预测轨迹。这种沉浸式反馈使收集者能够主动识别潜在失败，并将数据收集集中在策略的弱点区域，而无需物理机器人。此外，我们实现了一个异步在线微调管道，能够持续更新策略以适应新数据，从而在几分钟内有效地闭合学习循环。大量实验表明，RoboPocket遵循数据扩展法则，并且与离线扩展策略相比，数据效率提高了两倍，克服了其长期存在的效率瓶颈。此外，我们的即时迭代循环还在分布式环境中通过每人少量的交互修正将样本效率提高了多达2倍。项目页面和视频：https://robo-pocket.github.io。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

101

cs.CV / 1 / 2603.04405

Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

翻译中的迷失：语言如何重新对齐跨物种病理学的视觉

Arora, Ekansh

Abstract

Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP's failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.

Chinese Translation

基础模型在计算病理学中的应用日益增加，但它们在跨癌症和跨物种转移下的表现仍未明确。本研究调查了微调 CPath-CLIP 如何影响在同癌症、跨癌症和跨物种条件下的癌症检测，使用来自犬类和人类组织病理学的全幻灯片图像补丁。通过接收器操作特征曲线下面积（AUC）来衡量性能。少量样本微调改善了同癌症（64.9% 提升至 72.6% AUC）和跨癌症性能（56.84% 提升至 66.31% AUC）。跨物种评估表明，尽管组织匹配能够实现有意义的转移，但性能仍低于最先进的基准（H-optimus-0: 84.97% AUC），这表明标准的视觉-语言对齐对于跨物种泛化并不理想。嵌入空间分析显示肿瘤和正常原型之间的余弦相似度极高（大于 0.99）。Grad-CAM 显示基于原型的模型仍然被锁定在特定领域，而语言引导的模型则关注于保守的肿瘤形态。为了解决这一问题，我们引入了语义锚定（Semantic Anchoring），利用语言为视觉特征提供稳定的坐标系统。消融研究表明，益处源于文本对齐机制本身，而与文本编码器的复杂性无关。与 H-optimus-0 的基准测试表明，CPath-CLIP 的失败源于内在的嵌入崩溃，而文本对齐有效地规避了这一问题。在同癌症（8.52%）和跨癌症分类（5.67%）中观察到了额外的收益。我们识别出一种先前未表征的失败模式：由物种主导的对齐驱动的语义崩溃，而不是缺失的视觉信息。这些结果表明，语言作为控制机制，能够实现语义的重新解释而无需重新训练。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2603.04509

Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living

通过多模态深度学习识别日常活动：一种面向环境辅助生活的视频、姿态和物体感知方法

Hashemifard, Kooshan, Climent-Pérez, Pau, Florez-Revuelta, Francisco

Abstract

Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.

Chinese Translation

日常活动的识别是有效的环境辅助生活（AAL）系统中的一个关键要素，尤其是在室内环境中监测老年人的福祉和支持其独立性。然而，开发稳健的活动识别系统面临着显著的挑战，包括类内变异、类间相似性、环境变异、摄像机视角和场景复杂性。本文提出了一种针对AAL环境中老年人的日常生活活动识别的多模态方法。所提出的系统将通过3D卷积神经网络（CNN）处理的视觉信息与通过图卷积网络分析的3D人体姿态数据相结合。来自物体检测模块的上下文信息通过交叉注意机制与3D CNN特征融合，以提高识别准确性。该方法使用丰田智能家居数据集进行评估，该数据集包含真实的室内活动。结果表明，所提出的系统在多种日常活动中实现了具有竞争力的分类准确性，突显其作为先进AAL监测解决方案的重要组成部分的潜力。这一进展支持了开发促进老年人安全和自主性的智能系统的更广泛目标。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2603.04538

InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

InverseNet：跨压缩成像模式的算子不匹配与校准基准测试

Yang, Chengshuai, Yuan, Xin

Abstract

State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p < 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.

Chinese Translation

最先进的EfficientSCI在其假设的前向算子偏离物理现实的八个参数时损失20.58 dB，但现有的基准测试并未量化算子不匹配，这是部署的压缩成像系统中的默认条件。我们引入了InverseNet，这是第一个跨模式的算子不匹配基准，涵盖了CASSI、CACTI和单像素相机。在27个模拟场景和9个真实硬件捕获下，采用四种场景协议（理想、不匹配、oracle校正、盲校准）评估了12种方法，我们发现：（1）深度学习方法在不匹配情况下损失10-21 dB，消除了其相对于经典基线的优势；（2）不同模式的性能和鲁棒性呈负相关（Spearman r_s = -0.71, p < 0.01）；（3）无视掩模的架构在校准质量无论如何的情况下恢复0%的不匹配损失，而基于算子的条件方法恢复41-90%；（4）盲网格搜索校准在没有真实值的情况下恢复85-100%的oracle界限。真实硬件实验确认了模拟趋势可以转移到物理数据。代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2603.04562

Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

深度学习中用于多模态遥感数据的地方气候区分类的融合与分组策略

Thomas, Ancymol, Sreevalsan-Nair, Jaya

Abstract

Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6\%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion

Chinese Translation

地方气候区（LCZs）为研究城市结构和土地利用以及分析城市化对地方气候的影响提供了分区图。多模态遥感技术使得LCZ分类成为可能，其中数据融合因数据复杂性而对提高准确性具有重要意义。然而，目前对深度学习（DL）分类器架构中使用的融合机制缺乏全面分析。本研究分析了多类LCZ分类模型中针对多模态数据的不同融合策略，以及基于固有数据特征的分组策略。涉及卷积神经网络（CNNs）的不同模型包括：（i）基线混合融合（FM1），（ii）具有自注意力和交叉注意力机制（FM2），（iii）使用多尺度高斯滤波图像（FM3），以及（iv）加权决策级融合（FM4）。进行了消融实验，以研究像素级、特征级和决策级融合对模型性能的影响。分组策略包括数据模态内的波段分组（BG）和真实标签中的标签合并（LM）。我们的分析专门基于So2Sat LCZ42数据集，该数据集由合成孔径雷达（SAR）和多光谱成像（MSI）图像对组成。我们的结果表明，FM1在简单融合方法中始终表现优越。FM1结合BG和LM被发现是所有融合策略中最有效的方法，整体准确率达到76.6%。重要的是，我们的研究突出了这些策略在提高对代表性不足类别的预测准确性方面的作用。我们的代码和处理后的数据集可在https://github.com/GVCL/LCZC-MultiModalHybridFusion获取。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2603.04565

Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

基于结构引导的双重LoRA扩散组织病理图像合成

Xu, Xuan, Prasanna, Prateek

Abstract

Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.

Chinese Translation

组织病理图像合成在组织修复、数据增强和肿瘤微环境建模中发挥着重要作用。然而，现有的生成方法通常将修复和生成视为两个独立的任务，尽管两者都旨在在不同程度的缺失情况下实现结构一致的组织合成，并且往往依赖于弱或不一致的结构先验，这限制了真实的细胞组织。我们提出了双重LoRA可控扩散（Dual-LoRA Controllable Diffusion），这是一个统一的中心引导扩散框架，能够在单一模型中共同支持局部结构补全和全局结构合成。多类细胞核中心作为轻量且注释高效的空间先验，在部分和完全图像缺失的情况下提供生物学上有意义的指导。两个任务特定的LoRA适配器使共享的主干网络专注于局部和全局目标，而无需重新训练单独的扩散模型。大量实验表明，在修复和合成任务中，相较于最先进的GAN和扩散基线，我们的方法在性能上有了一致的提升。在局部补全方面，遮罩区域内计算的LPIPS从0.1797（HARP）提高到0.1524；在全局合成方面，FID从225.15（CoSys）降低到76.04，表明结构保真度和真实感得到了改善。我们的方法在遮罩区域实现了更忠实的结构恢复，并在完整合成中显著提高了真实感和形态一致性，支持可扩展的全癌症组织病理建模。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2603.04568

Mask-aware inference with State-Space Models

基于状态空间模型的掩码感知推理

Mas, Ignasi, Morros, Ramon, Hidalgo, Javier-Ruiz, Huerta, Ivan

Abstract

Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.

Chinese Translation

许多现实世界的计算机视觉任务，如深度补全，必须处理具有任意形状的缺失或无效数据区域的输入。对于卷积神经网络（CNN），部分卷积通过仅基于有效像素的掩码感知重新归一化解决了这一问题。最近，状态空间模型（SSM）如 Mamba 的出现，提供了高性能且具有线性复杂度的解决方案。然而，这些架构在推理时缺乏处理任意形状无效数据的内在机制。为了解决这一问题，我们提出了部分视觉 Mamba（PVM），这是一种新颖的架构组件，将部分操作的原则移植到 Mamba 主干中。我们还定义了一系列规则，以使用 PVM 设计架构。我们展示了我们的方法在深度补全、图像修复和处理无效数据的分类任务中的有效性和普适性。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2603.04598

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

PinPoint：带有显式负例、多图像查询和释义测试的组合图像检索评估

Mahadev, Rohan, Yuan, Joyce, Poirson, Patrick, Xue, David, Wu, Hao-Yu, Kislyuk, Dmitry

Abstract

Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

Chinese Translation

组合图像检索（CIR）已取得显著进展，但当前的基准测试仅限于单一的真实答案，并缺乏评估误报避免、鲁棒性和多图像推理所需的注释。我们提出了PinPoint，这是一个全面的现实世界基准，包含7,635个查询和329K个相关性判断，涵盖23个查询类别。PinPoint通过提供以下内容推动了该领域的发展：(1) 多个正确答案（每个查询平均9.1个），(2) 显式的困难负例，(3) 每个查询六个指令释义以进行鲁棒性测试，(4) 多图像组合支持（13.4%的查询），以及(5) 用于公平性评估的人口统计元数据。基于我们对20多种方法在4种不同主要范式中的分析，我们发现了三个显著缺陷：最佳方法在达到mAP@10为28.5%的同时，仍有9%的时间检索到无关结果（困难负例）。最佳模型在释义之间的性能变异达到25.1%，显示出增强当前CIR技术的显著潜力。不同方法下的多图像查询性能下降40%至70%。为了克服我们评估框架揭示的新问题，我们提出了一种基于现成的MLLM的无训练重排序方法，可以应用于任何现有系统以弥补差距。我们发布了完整的数据集，包括所有图像、查询、注释、检索索引和基准代码。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2603.04614

SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D

SGR3模型：3D场景图检索-推理模型

Wang, Zirui, Liu, Ruiping, Chen, Yufan, Zheng, Junwei, Fan, Weijia, Peng, Kunyu, Wen, Di, Wei, Jiale, Zhang, Jiaming, Stiefelhagen, Rainer

Abstract

3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.

Chinese Translation

3D场景图提供了对象实体及其关系的结构化表示，使机器人能够进行高层次的解释和推理，同时对人类保持直观可理解性。现有的3D场景图生成方法通常将场景重建与图神经网络（GNNs）相结合。然而，这些流程需要多模态数据，而这些数据并不总是可用，并且对启发式图构建的依赖可能限制关系三元组的预测。在本研究中，我们提出了一种3D场景图检索-推理模型（SGR3模型），这是一个无训练的框架，利用多模态大型语言模型（MLLMs）与检索增强生成（RAG）进行语义场景图生成。SGR3模型绕过了显式3D重建的需求。相反，它通过结合通过ColPali风格的跨模态框架检索的语义对齐场景图来增强关系推理。为了提高检索的鲁棒性，我们进一步引入了一种加权补丁级相似性选择机制，以减轻模糊或语义信息不足区域的负面影响。实验表明，SGR3模型在与无训练基线相比时表现出竞争力，并与基于GNN的专家模型相当。此外，对检索模块和知识库规模的消融研究表明，检索到的外部信息被明确整合到标记生成过程中，而不是通过抽象隐式内化。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2603.04638

Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

Spinverse：基于可渗透性意识的微观结构重建的可微分物理方法

Khole, Prathamesh Pradeep, Brenes, Mario M., Petiwala, Zahra Kais, Mirafzali, Ehsan, Gupta, Utkarsh, Li, Jing-Rebecca, Ianus, Andrada, Marinescu, Razvan

Abstract

Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.

Chinese Translation

扩散磁共振成像（dMRI）对微观结构障碍敏感，然而大多数现有方法要么假设边界不可渗透，要么在不恢复显式界面的情况下估计体素级参数。我们提出了Spinverse，这是一种可渗透性意识的重建方法，通过完全可微的Bloch-Torrey模拟器反演dMRI测量。Spinverse在固定的四面体网格上表示组织，并将每个内部面渗透性视为可学习参数；低渗透性面作为扩散障碍，因此微观结构边界的拓扑在不改变网格连接性或顶点位置的情况下（直到环境网格的分辨率）自然而然地出现。给定目标信号，我们通过在偏微分方程（PDE）前向模型中反向传播信号匹配损失来优化面渗透性，并通过阈值化学习到的渗透性场来恢复界面。为了缓解渗透性反演的病态性，我们使用基于网格的几何先验；为避免局部最小值，我们采用分阶段的多序列优化课程。在一系列合成体素网格中，Spinverse重建了多样的几何形状，并证明序列调度和正则化对于避免仅有轮廓的解决方案至关重要，同时提高了边界准确性和结构有效性。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2603.04673

sFRC for assessing hallucinations in medical image restoration

用于评估医学图像恢复中幻觉的 sFRC

Kc, Prabhat, Zeng, Rongping, Soni, Nirmal, Badano, Aldo

Abstract

Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.

Chinese Translation

深度学习（DL）方法目前正在被探索用于从稀疏视图、有限数据和欠采样的医学成像获取中恢复图像。尽管基于可接受性/主观标准（如较少噪声、平滑特征）的 DL 输出在视觉上可能显得令人满意，但它们也可能存在幻觉问题。缺乏易用的技术和稳健的指标来识别 DL 输出中的幻觉，使得这一问题更加严重。在本研究中，我们提出对小块进行傅里叶环相关（FRC）分析，并同时（s）扫描 DL 输出及其参考对应物，以检测幻觉（称为 sFRC）。我们描述了 sFRC 背后的理论基础，并提供其数学公式。sFRC 的关键参数可以使用主题专家标注的预定义幻觉特征或基于成像理论的幻觉图进行设置。我们使用 sFRC 检测三种欠采样医学成像问题中的幻觉：CT 超分辨率、CT 稀疏视图和 MRI 子采样恢复。在测试阶段，我们展示了 sFRC 在 CT 问题中检测幻觉特征的有效性，以及 sFRC 在 MR 问题中与基于成像理论的输出在幻觉特征图上的一致性。最后，我们量化了 DL 方法在分布内与分布外数据上的幻觉率，并在增加的子采样率下表征 DL 方法的稳健性。除了基于 DL 的方法，sFRC 在检测传统正则化恢复方法和最先进的展开方法中的幻觉有效性也得到了展示。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2603.04676

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

解码多图像理解任务中推理视觉语言模型的脉动

Li, Chenjun

Abstract

Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).

Chinese Translation

多图像推理仍然是视觉语言模型（VLMs）面临的重要挑战。我们研究了一个之前被忽视的现象：在思维链（CoT）生成过程中，推理VLMs的文本到图像（T2I）注意力表现出弥散的“脉动”：偶发且不集中注意力的模式，未能聚焦于与任务相关的图像。我们进一步揭示了注意力分配在图像间的系统性位置偏差。基于这些观察，我们提出了PulseFocus，这是一种无训练、推理时的方法，将CoT推理结构化为交错的计划/聚焦块，并采用软注意力门控。通过强制模型明确规划要检查的图像，然后在解码时对参考图像进行门控，PulseFocus增强了注意力的聚焦，并在多图像基准测试（如BLINK基准测试+3.7%和MuirBench+1.07%）上取得了一致的改进。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2603.04720

A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

高光谱图像分类的神经网络压缩方法基准研究

Shi, Sai

Abstract

Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.

Chinese Translation

深度神经网络在图像分类任务中表现出色，因其能够从高维数据中学习复杂模式。然而，其巨大的计算和内存需求常常限制了在资源受限的平台（如遥感设备和边缘系统）上的部署。因此，提出了网络压缩技术，以在保持预测性能的同时减少模型大小和计算成本。在本研究中，我们对用于遥感应用的神经网络压缩方法进行了系统评估，特别是高光谱土地覆盖分类。具体而言，我们考察了三种广泛使用的卷积神经网络压缩策略：剪枝、量化和知识蒸馏。我们在两个基准高光谱数据集上进行了实验，考虑了分类准确性、内存消耗和推理效率。我们的结果表明，压缩模型可以显著减少模型大小和计算成本，同时保持竞争力的分类性能。这些发现为压缩比、效率和准确性之间的权衡提供了深入见解，并突显了压缩技术在促进遥感应用中高效深度学习部署的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2603.04727

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型准备好进行监控了吗？对零-shot异常检测的现实检验

Yao, Shanle, Pazho, Armin Danesh, Rashvand, Narges, Tabkhi, Hamed

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

Chinese Translation

多模态大语言模型（MLLMs）在视频理解方面展示了令人印象深刻的通用能力，但它们在真实世界视频异常检测（VAD）中的可靠性仍然 largely 未被探索。与依赖重建或姿态线索的传统流程不同，MLLMs 使得一种范式转变成为可能：将异常检测视为一种语言引导的推理任务。在本研究中，我们通过将 VAD 重新表述为在弱时间监督下的二分类任务，系统地评估了在上海科技（ShanghaiTech）和 CHAD 基准上的最先进 MLLMs。我们研究了提示的特异性和时间窗口长度（1秒至3秒）如何影响性能，重点关注精确度与召回率的权衡。我们的发现揭示了在零-shot 设置中显著的保守偏见；尽管模型表现出高信心，但它们不成比例地偏向于“正常”类别，导致高精度但召回率崩溃，从而限制了实际效用。我们证明了类别特定的指令可以显著改变这一决策边界，将上海科技的峰值 F1 分数从 0.09 提高到 0.64，但召回率仍然是一个关键瓶颈。这些结果突显了 MLLMs 在噪声环境中的显著性能差距，并为未来在开放世界监控中进行以召回为导向的提示和模型校准的研究奠定了基础，这需要复杂的视频理解和推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2603.04733

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

FOZO：用于测试时适应的前向仅零阶提示优化

Wang, Xingyu, Wang, Tao

Abstract

Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.

Chinese Translation

测试时适应（TTA）对于使深度学习模型能够处理现实世界的数据分布变化至关重要。然而，目前的方法面临显著的局限性：基于反向传播的方法由于其高计算和内存需求，以及在适应过程中修改模型权重的倾向，不适合低端部署设备；而传统的无反向传播技术则表现出有限的适应能力。在本研究中，我们提出了前向仅零阶优化（FOZO），这是一种新颖且实用的无反向传播TTA范式。FOZO利用了一种内存高效的零阶提示优化，该优化目标是同时优化中间特征统计和预测熵。为了确保在分布外数据流上的高效和稳定适应，我们在零阶梯度估计过程中引入了动态衰减的扰动尺度，并在TTA数据流假设下理论证明了其收敛性。在ImageNet-C、ImageNet-R和ImageNet-Sketch上的大量持续适应实验表明，FOZO的性能优越，在ImageNet-C（5K，级别5）上达到了59.52%的Top-1准确率，超越了主要的基于梯度的方法和最新的前向仅FOA（58.13%）。此外，FOZO在量化（INT8）模型上表现出强大的泛化能力。这些发现表明，FOZO是资源有限场景下TTA部署的一个高度竞争的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2603.04745

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

面向真实世界红外图像超分辨率：统一自回归框架与基准数据集

Zou, Yang, Ma, Jun, Jiao, Zhidong, Li, Xingyuan, Jiang, Zhiying, Liu, Jinyuan

Abstract

Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.

Chinese Translation

在真实世界条件下，红外图像超分辨率（IISR）是一项具有实际意义但鲜有研究的任务。先驱性工作通常在模拟数据集上进行训练和评估，或忽视红外成像与可见光成像之间的内在差异。然而，在实际应用中，真实的红外图像受到光学和传感器退化的耦合影响，这共同导致了结构清晰度和热信号的衰退。为了解决这些挑战，我们提出了Real-IISR，这是一个针对真实世界IISR的统一自回归框架，通过热-结构引导的视觉自回归，以逐级的方式重建细致的热结构和清晰的背景。具体而言，热-结构引导模块编码热先验，以减轻热辐射与结构边缘之间的不匹配。由于非均匀退化通常会引起量化偏差，Real-IISR采用条件自适应码本，根据感知到的热先验动态调节离散表示。此外，热序一致性损失强制温度与像素强度之间保持单调关系，确保相对亮度顺序而非绝对值，以在空间错位和热漂移下保持物理一致性。我们构建了FLIR-IISR，这是一个真实世界的IISR数据集，包含通过自动聚焦变化和运动引起的模糊获取的配对LR-HR红外图像。大量实验表明Real-IISR的良好性能，为真实世界IISR和基准测试提供了统一的基础。数据集和代码可在以下网址获取：https://github.com/JZD151/Real-IISR。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2603.04763

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

评估GPT-5作为多模态临床推理者：一项领域评论

Florea, Alexandru, Wang, Shansong, Hu, Mingzhe, Li, Qiang, Eidex, Zach, del Balzo, Luke, Safari, Mojtaba, Yang, Xiaofeng

Abstract

The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5's 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician's cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.

Chinese Translation

从任务特定的人工智能向通用基础模型的转变引发了关于其支持临床医学中所需的综合推理能力的根本性问题，在临床医学中，诊断需要综合模糊的患者叙述、实验室数据和多模态影像。本文提供了对GPT-5系列（GPT-5、GPT-5 Mini、GPT-5 Nano）与其前身GPT-4o在多种临床基础任务中的首次受控横断面评估，包括医学教育考试、基于文本的推理基准，以及在神经放射学、数字病理学和乳腺摄影中的视觉问答，采用标准化的零-shot思维链协议。GPT-5在专家级文本推理方面表现出显著提升，在MedXpertQA上的绝对改进超过25个百分点。在多模态综合任务中，GPT-5有效利用了这一增强的推理能力，将不确定的临床叙述与具体的影像证据结合起来，在大多数视觉问答基准中实现了最先进或具有竞争力的表现，并在需要细致病变特征描述的乳腺摄影任务中超越了GPT-4o，提升幅度在10-40%之间。然而，在神经放射学中的表现仍然适中（宏观平均准确率为44%），并且在乳腺摄影中落后于领域特定模型，后者的准确率超过80%，而GPT-5的准确率为52-64%。这些发现表明，尽管GPT-5在向综合多模态临床推理的进步上具有重要意义，反映了临床医生在客观发现中偏向不确定信息的认知过程，但通用模型尚不能替代专门构建的系统在高度专业化和感知关键任务中的应用。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2603.04766

Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

评估与纠正动态微表情识别中的人类标注偏差

Liu, Feng, Nan, Bingyu, Qian, Xuezhong, Fu, Xiaolan

Abstract

Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[https://github.com/Cross-Innovation-Lab/GAMDSS].

Chinese Translation

现有的微表情手动标注存在准确性错误，尤其是在跨文化场景中，关键帧标注的偏差更加明显。为了解决这个问题，本文提出了一种新颖的全局反单调差异选择策略（Global Anti-Monotonic Differential Selection Strategy, GAMDSS）架构，通过关键帧重新选择来增强微表情的时空建模效果。具体而言，该方法通过动态帧重新选择机制，从完整的微表情动作序列中识别出显著变化的起始（Onset）和顶峰（Apex）帧。然后利用这些帧确定结束（Offset）帧，并构建丰富的时空动态表示。接着，采用共享参数的双分支结构高效提取时空特征。在七个广泛认可的微表情数据集上进行了大量实验。结果表明，GAMDSS有效减少了在多文化数据集中由人类因素引起的主观错误，如SAMM和4DME。此外，定量分析确认多文化数据集中的结束帧标注更具不确定性，为微表情标注的标准化提供了理论依据。这些发现直接支持我们重新考虑数据集标注范式的有效性和普遍性的论点。值得注意的是，该设计可以无缝集成到现有模型中，而不会增加参数数量，为提升微表情识别性能提供了一种新方法。源代码可在GitHub上获取[https://github.com/Cross-Innovation-Lab/GAMDSS]。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2603.04770

DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction

DSA-SRGS：动态稀疏视图 DSA 重建的超分辨率高斯点云技术

Zhang, Shiyu, Wu, Zhicong, Zhao, Huangxuan, Liu, Zhentao, Chen, Lei, Luo, Yong, Zhang, Lefei, Cui, Zhiming, Ke, Ziwen, Du, Bo

Abstract

Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.

Chinese Translation

数字减影血管造影（DSA）是一种关键的成像技术，用于辅助诊断和治疗脑血管疾病。最近在高斯点云和动态神经表示方面的进展，使得从稀疏动态输入中实现稳健的三维血管重建成为可能。然而，这些方法在本质上受到输入投影分辨率的限制，简单的上采样以提高渲染分辨率不可避免地会导致严重的模糊和混叠伪影。这种缺乏超分辨率能力的情况使得重建的四维模型无法恢复细致的血管细节和复杂的分支结构，从而限制了其在精准诊断和治疗中的应用。为了解决这个问题，本文提出了 DSA-SRGS，这是第一个用于动态稀疏视图 DSA 重建的超分辨率高斯点云框架。具体而言，我们引入了一个多保真度纹理学习模块，该模块将来自经过微调的 DSA 特定超分辨率模型的高质量先验整合到四维重建优化中。为了减轻伪标签可能产生的幻觉伪影，该模块采用了一种基于置信度的策略，自适应地加权原始低分辨率投影与生成的高分辨率伪标签之间的监督信号。此外，我们开发了辐射子像素密集化，这是一种自适应策略，利用来自高分辨率子像素采样的梯度累积来细化四维辐射高斯核。在两个临床 DSA 数据集上的大量实验表明，DSA-SRGS 在定量指标和定性视觉保真度方面显著优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2603.04771

MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement

MADCrowner：基于边缘感知的牙冠设计与模板变形和精细化

Wei, Linda, Liu, Chang, Zhang, Wenran, Hu, Yuxuan, Li, Ruiyang, Qi, Feng, Tian, Changyao, Wang, Ke, Wang, Yuanyuan, Zhang, Shaoting, Metaxas, Dimitris, Li, Hongsheng

Abstract

Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.

Chinese Translation

牙冠修复是牙齿缺损最常见的治疗方式之一，个性化的牙冠设计至关重要。尽管计算机辅助设计（CAD）系统显著提高了牙冠设计的效率，但在临床工作流程中仍需进行大量手动调整。近期研究探讨了基于学习的方法在自动生成修复性牙冠中的应用。然而，这些方法面临着空间分辨率不足、输出噪声和表面重建过度扩展等挑战。为了解决这些局限性，我们提出了 otalframework，一个边缘感知的网格生成框架，包括CrownDeformR和CrownSegger。受到牙冠设计临床手动工作流程的启发，我们设计了CrownDeformR，以根据解剖上下文将初始模板变形为目标牙冠，该上下文由多尺度口内扫描编码器提取。此外，我们引入了 extmarginseg，一个新颖的边缘分割网络，用于提取目标牙齿的颈缘。CrownDeformR的性能在颈缘作为额外约束的情况下得到了提升。同时，它也被用作定制后处理方法的边界条件，以去除重建表面过度扩展的区域。我们构建了一个大规模的口内扫描数据集并进行了广泛的实验。所提出的方法在几何精度和临床可行性方面显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2603.04775

Privacy-Aware Camera 2.0 Technical Report

隐私意识相机 2.0 技术报告

Song, Huan, Tian, Shuyu, Long, Ting, Liu, Jiang, Yuan, Cheng, Jia, Zhenyu, Shao, Jiawei, Li, Xuelong

Abstract

With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.

Chinese Translation

随着智能感知技术在如卫生间和更衣室等高度敏感环境中的广泛部署，视觉监控系统面临着深刻的隐私与安全悖论。现有的隐私保护方法，包括物理去敏感化、加密和模糊处理，往往妥协了语义理解或未能确保数学上可证明的不可逆性。尽管隐私相机 1.0 通过在源头消除视觉数据以防止泄露，但仅提供文本判断，导致在争议中出现证据盲点。为了解决这些局限性，本文提出了一种基于 AI Flow 范式和协同边缘-云架构的新型隐私保护感知框架。通过在边缘部署视觉去敏感化器，原始图像通过非线性映射和随机噪声注入在实时中转化为抽象特征向量，遵循信息瓶颈原则，确保身份敏感信息被剥离，原始图像在数学上不可重构。抽象表示被传输到云端进行行为识别和语义重构，采用“动态轮廓”视觉语言，实现感知与隐私之间的关键平衡，同时在不暴露原始图像的情况下启用说明性视觉参考。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2603.04793

RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery

RMK RetinaNet：用于遥感影像中稳健定向目标检测的旋转多核RetinaNet

Sun, Huiran

Abstract

Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.

Chinese Translation

遥感影像中的旋转目标检测受到三个主要瓶颈的限制：非自适应的感受野利用、不充分的长距离多尺度特征融合以及角度回归中的不连续性。为了解决这些问题，我们提出了旋转多核RetinaNet（RMK RetinaNet）。首先，我们设计了一个多尺度核（MSK）模块，以增强自适应多尺度特征提取。其次，我们将多方向上下文锚点注意力（MDCAA）机制融入特征金字塔，以增强跨尺度和方向的上下文建模。第三，我们引入了自下而上的路径，以保留在下采样过程中常常退化的细粒度空间细节。最后，我们开发了一个欧拉角编码模块（EAEM），以实现连续和稳定的角度回归。在DOTA-v1.0、HRSC2016和UCAS-AOD上的大量实验表明，RMK RetinaNet在多尺度和多方向场景中提高了稳健性，同时其性能与最先进的旋转目标检测器相当。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2603.04795

LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation

法律与秩序：医学扩散与分割的自适应空间加权

Naman, Anugunj, Singh, Ayushman, Zhang, Gaibo, Zhang, Yaguang

Abstract

Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.

Chinese Translation

医学图像分析依赖于准确的分割，并受益于可控的合成（新训练图像）。然而，循环管道中的这两个任务面临空间不平衡的问题：病灶在广阔背景中占据小区域。特别是，扩散模型已被证明会偏离规定的病灶布局，而高效的分割器在空间不确定区域上表现不佳。自适应空间加权通过学习如何分配计算资源来解决这一问题。本文介绍了一对网络适配器：1）可学习自适应加权器（Learnable Adaptive Weighter, LAW），该加权器从特征和掩模中预测每像素的损失调制，通过归一化、限制和正则化的混合来稳定，以防止退化解；2）高效分辨率的最优区域检测（Optimal Region Detection with Efficient Resolution, ORDER），该方法在解码器的后期阶段应用选择性双向跳跃注意力，以实现高效分割。在息肉和肾肿瘤数据集上的实验表明，LAW在均匀基线（52.28 vs. 65.60）上实现了20%的FID生成改进，合成数据随后使下游分割提高了4.9%的Dice系数（83.2% vs. 78.3%）。ORDER在MK-UNet上实现了6.0%的Dice改进（81.3% vs. 75.3%），其计算量为0.56 GFLOPs，仅有42K参数，仍比标准nnUNet小730倍。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2603.04796

Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper

传统方法与深度学习在脑胶质瘤影像学中的比较评估：综述

Janardhan, Kiranmayee, Prabhu, Vinay Martin DSa, Bobby, T. Christy

Abstract

Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.

Chinese Translation

分割对于脑胶质瘤至关重要，因为它勾勒出胶质瘤的范围和位置，有助于精确的治疗规划和监测，从而改善患者的预后。准确的分割确保了对胶质瘤大小和位置的正确识别，将影像转化为可用于分析的数据。脑胶质瘤的分类同样重要，因为不同类型的胶质瘤需要不同的治疗方法。根据大小、位置和侵袭性准确分类脑胶质瘤对于个性化预后预测、后续护理和疾病进展监测至关重要，确保有效的诊断、治疗和管理。在胶质瘤研究中，常常可以观察到不规则的组织，但无误且可重复的分割却具有挑战性。许多研究者对脑胶质瘤的分割进行了调查，提出了全自动和半自动技术。这些方法的采用取决于放射科医生的使用便利性和监督需求，半自动技术因其对准确评估的需求而更受欢迎。本综述评估了磁共振成像获取后有效的分割和分类技术，强调卷积神经网络架构在这些任务中优于传统技术。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2603.04800

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

MASQuant：面向模态的平滑量化用于多模态大型语言模型

Hu, Lulu, Xiao, Wenhu, Chen, Xin, Xu, Xinhua, Xu, Bowen, Li, Kun, Tao, Yongliang

Abstract

Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.

Chinese Translation

后训练量化（PTQ）在大型语言模型（LLMs）中表现出了显著的进展，然而，其在多模态大型语言模型（MLLMs）中的应用面临着重大挑战。本文以SmoothQuant为案例进行分析，并识别出两个关键问题：平滑不对齐和跨模态计算不变性。为了解决这些问题，我们提出了面向模态的平滑量化（MASQuant），这是一个新颖的框架，介绍了（1）面向模态的平滑（MAS），它学习独立的、特定于模态的平滑因子，以防止平滑不对齐，以及（2）跨模态补偿（CMC），通过使用奇异值分解（SVD）白化将多模态激活差异转化为低秩形式，从而解决跨模态计算不变性，实现跨模态的统一量化。MASQuant在双模态和三模态MLLMs中展示了稳定的量化性能。实验结果表明，MASQuant在最先进的PTQ算法中具有竞争力。源代码：https://github.com/alibaba/EfficientAI。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2603.04803

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

利用对比信号引导基于扩散的重建以实现平衡的视觉表征

Han, Boyu, Xu, Qianqian, Bao, Shilong, Yang, Zhiyong, Cui, Ruochen, Zhao, Xilin, Huang, Qingming

Abstract

The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.

Chinese Translation

在对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP）中，视觉编码器的理解能力有限已成为下游性能的关键瓶颈。这种能力包括区分能力（Discriminative Ability, D-Ability），反映类别可分性，以及细节感知能力（Detail Perceptual Ability, P-Ability），关注细粒度视觉线索。最近的解决方案使用扩散模型通过将图像重建与CLIP视觉标记相结合来增强表征。我们认为，这种范式可能会妨碍D-Ability，从而未能有效解决CLIP的表征局限性。为了解决这一问题，我们将对比信号整合到基于扩散的重建中，以追求更全面的视觉表征。我们从一个简单的设计开始，通过对输入图像进行对比学习来增强扩散过程。然而，实证结果表明，简单的组合存在梯度冲突，导致性能不佳。为了平衡优化，我们引入了扩散对比重建（Diffusion Contrastive Reconstruction, DCR），统一学习目标。关键思想是将来自每个重建图像的对比信号注入到扩散过程中，而不是来自原始输入。我们的理论分析表明，DCR损失可以共同优化D-Ability和P-Ability。在各种基准和多模态大型语言模型上的大量实验验证了我们方法的有效性。代码可在 https://github.com/boyuh/DCR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2603.04811

Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation

Meta-D：用于脑肿瘤分析和缺失模态分割的元数据感知架构

Kim, SangHyuk, Haehn, Daniel, Rampersad, Sumientra

Abstract

We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.

Chinese Translation

我们提出了Meta-D，这是一种明确利用分类扫描仪元数据（例如MRI序列和平面方向）来指导脑肿瘤分析特征提取的架构。我们的目标是通过整合明确的元数据来提高医学图像深度学习管道的性能，以稳定特征表示。我们首先在二维肿瘤检测中评估这一点，其中注入序列（例如，T1、T2）和平面（例如，轴向）元数据动态调节卷积特征，相较于仅使用图像的基线，F1-score的绝对提升可达2.62%。由于元数据在数据可用时为特征提取提供基础，我们假设它可以在数据缺失时作为一个稳健的锚点。我们将其应用于三维缺失模态肿瘤分割。我们的Transformer Maximizer利用基于元数据的交叉注意力来隔离和引导可用模态，确保网络关注有效切片。在极端模态稀缺的情况下，这种针对性的注意力使脑肿瘤分割的Dice分数提高了最多5.12%，同时减少了模型参数24.1%。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2603.04817

Revisiting Shape from Polarization in the Era of Vision Foundation Models

在视觉基础模型时代重新审视基于偏振的形状重建

Li, Chenhao, Ono, Taishi, Uemori, Takeshi, Moriuchi, Yusuke

Abstract

We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.

Chinese Translation

我们展示了，利用偏振线索，基于小型数据集训练的轻量级模型在单次拍摄的物体级表面法线估计中可以超越仅基于RGB的视觉基础模型（VFM）。基于偏振的形状重建（SfP）因其偏振与表面几何之间的强物理关系而长期受到研究。同时，受规模法则的驱动，基于大型数据集训练的仅RGB VFM最近取得了令人印象深刻的性能，并超越了现有的SfP方法。这种情况引发了关于偏振线索必要性的质疑，因为它们需要专用硬件且训练数据有限。我们认为，之前SfP方法性能较弱并非源于偏振模态本身，而是由于领域间的差距。这些领域间的差距主要来自两个方面。首先，现有的合成数据集使用有限且不现实的3D物体，几何形状简单且随机纹理图与基础形状不匹配。其次，现实世界中的偏振信号常常受到传感器噪声的影响，而这种噪声在训练过程中未得到良好建模。为了解决第一个问题，我们使用1954个3D扫描的真实世界物体渲染了一个高质量的偏振数据集。我们进一步结合预训练的DINOv3先验，以提高对未见物体的泛化能力。为了解决第二个问题，我们引入了偏振传感器感知的数据增强，更好地反映现实世界的条件。仅用40K训练场景，我们的方法显著超越了最先进的SfP方法和仅RGB的VFM。大量实验表明，偏振线索使得训练数据减少33倍或模型参数减少8倍，同时仍实现比仅RGB对应物更好的性能。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2603.04825

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

缓解实例依赖的部分标签学习中的实例纠缠

Zhao, Rui, Shi, Bin, Sun, Kai, Dong, Bo

Abstract

Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.

Chinese Translation

部分标签学习是一种重要的弱监督分类任务，其中每个训练实例都模糊地标记为一组候选标签。在现实场景中，候选标签通常受到实例特征的影响，从而导致实例依赖的部分标签学习（Instance-Dependent Partial Label Learning, ID-PLL）的出现，这种设置更准确地反映了这种关系。ID-PLL中的一个重大挑战是实例纠缠，即来自相似类别的实例共享重叠的特征和候选标签，导致类别混淆加剧。为了解决这个问题，我们提出了一种新颖的基于类别特定增强的解缠框架（Class-specific Augmentation based Disentanglement, CAD），该框架通过类内和类间的调节来应对实例纠缠。在类内调节方面，CAD放大类别特定特征以生成类别增强，并对同类增强进行对齐。在类间调节方面，CAD引入了一种加权惩罚损失函数，对更模糊的标签施加更强的惩罚，从而鼓励更大的类间距离。通过联合应用类内和类间调节，CAD提高了类别边界的清晰度，并减少了因纠缠造成的类别混淆。大量实验结果证明了CAD在缓解纠缠问题和提升ID-PLL性能方面的有效性。代码可在 https://github.com/RyanZhaoIc/CAD.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2603.04839

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

面向高度可迁移的视觉-语言攻击：基于语义增强的动态对比交互

Li, Yuanbo, Xu, Tianyang, Hu, Cong, Zhou, Tao, Wu, Xiao-Jun, Kittler, Josef

Abstract

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Chinese Translation

随着视觉-语言预训练（VLP）模型的快速发展和广泛应用，它们对对抗攻击的脆弱性已成为一个关键问题。一般而言，对抗样本通常可以设计为具有可迁移性，不仅攻击不同的模型，还跨越多种任务。然而，现有对语言-视觉模型的攻击主要依赖于静态的跨模态交互，并且仅关注于破坏正向图像-文本对，导致跨模态干扰有限且可迁移性差。为了解决这个问题，我们提出了一种语义增强的动态对比攻击（SADCA），通过渐进的、语义引导的扰动增强对抗可迁移性。SADCA通过对抗图像和文本之间的动态交互，逐步破坏跨模态对齐。这是通过SADCA建立一个对比学习机制，涉及对抗样本、正样本和负样本，以增强所获得扰动的语义不一致性。此外，我们实证发现，传统迁移攻击中常用的输入变换也对VLP有益，这激励了一个语义增强模块，以增加对抗样本的多样性和泛化性。在多个数据集和模型上的大量实验表明，SADCA显著提高了对抗可迁移性，并始终超越了最先进的方法。代码已发布在 https://github.com/LiYuanBoJNU/SADCA。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2603.04846

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models

针对多模态大型语言模型的多范式协同对抗攻击

Li, Yuanbo, Xu, Tianyang, Hu, Cong, Zhou, Tao, Wu, Xiao-Jun, Kittler, Josef

Abstract

The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.

Chinese Translation

多模态大型语言模型（MLLMs）的快速进展显著推动了下游应用的发展。然而，这一进展也暴露了严重的可转移对抗脆弱性。通常，现有针对MLLMs的对抗攻击通常依赖于在单一学习范式下训练的替代模型，并在各自的特征空间中进行独立优化。这种简单的设置自然限制了特征表示的丰富性，限制了搜索空间，从而阻碍了对抗扰动的多样性。为了解决这一问题，我们提出了一种新颖的多范式协同攻击（MPCAttack）框架，以增强针对MLLMs的对抗样本的可转移性。原则上，MPCAttack聚合来自视觉图像和语言文本的语义表示，通过多范式协同优化（MPCO）策略促进对聚合特征的联合对抗优化。通过对多范式特征进行对比匹配，MPCO自适应地平衡不同范式表示的重要性，并引导全局扰动优化，有效缓解表示偏差。在多个基准上的大量实验结果表明，MPCAttack的优越性，表明我们的解决方案在针对开源和闭源MLLMs的有针对性和无针对性攻击中始终优于最先进的方法。代码已发布在 https://github.com/LiYuanBoJNU/MPCAttack。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2603.04847

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

GloSplat：更快更准确的3D重建的联合姿态-外观优化

Xiong, Tianyu, Li, Rui, Li, Linjie, Yang, Jiaqi

Abstract

Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

Chinese Translation

特征提取、匹配、运动结构（SfM）和新视图合成（NVS）传统上被视为独立优化目标的单独问题。我们提出了GloSplat，一个在3D高斯点云训练过程中执行 extit{联合姿态-外观优化}的框架。与之前依赖于光度梯度进行姿态优化的联合优化方法（如BARF、NeRF--、3RGS）不同，GloSplat在整个训练过程中将 extit{显式SfM特征轨迹}作为第一类实体进行保留：3D点轨迹作为与高斯原语分开的可优化参数进行维护，通过与光度监督并行操作的重投影损失提供持久的几何锚点。这一架构选择防止了早期阶段的姿态漂移，同时实现了细粒度的优化——这是仅依赖光度的方法所缺乏的能力。我们介绍了两种管道变体：（1） extbf{GloSplat-F}，一种不使用COLMAP的变体，采用基于检索的配对选择以实现高效重建；（2） extbf{GloSplat-A}，一种全面匹配的变体以实现最大质量。两者都采用全局SfM初始化，随后在3D高斯点云训练过程中进行联合光度-几何优化。实验表明，GloSplat-F在不使用COLMAP的方法中实现了最先进的性能，而GloSplat-A超越了所有基于COLMAP的基线。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2603.04864

Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video

基于广播视频的可扩展棒球投球伤害风险筛查

Bright, Jerrin, Mende, Justin, Zelek, John

Abstract

Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.

Chinese Translation

投球中的伤害预测依赖于精确的生物力学信号，然而金标准测量来自昂贵的、安装在体育场的多摄像头系统，这些系统在专业场馆之外是不可用的。我们提出了一种单目视频处理管道，从广播录像中恢复18个临床相关的生物力学指标，将基于姿态的运动学定位为伤害风险建模的可扩展来源。基于DreamPose3D，我们的方法引入了一个漂移控制的全局提升模块，通过基于速度的参数化和滑动窗口推理恢复骨盆轨迹，将基于骨盆的姿态提升到全局空间。为了解决运动模糊、压缩伪影和极端投球姿态的问题，我们结合了一个运动学精炼管道，采用骨长约束、关节限制的逆运动学、平滑处理和对称性约束，以确保时间上稳定和物理上合理的运动学。在13名职业投手（156个配对投球）上，16/18个指标达到了亚度数的一致性（MAE < 1°）。使用这些指标进行伤害预测，自动化筛查模型在7,348名投手中对汤米·约翰手术的AUC为0.811，对显著手臂伤害的AUC为0.825。由此产生的基于姿态的指标支持可扩展的伤害风险筛查，确立了单目广播视频作为生物力学的可行替代方案，取代体育场规模的运动捕捉。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2603.04869

SURE: Semi-dense Uncertainty-REfined Feature Matching

SURE：半稠密不确定性精炼特征匹配

Li, Sicheng, Gu, Zaiwang, Zhang, Jie, Guo, Qing, Jiang, Xudong, Cheng, Jun

Abstract

Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.

Chinese Translation

建立可靠的图像对应关系对于许多机器人视觉问题至关重要。然而，现有方法在大视角变化或无纹理区域等具有挑战性的场景中往往表现不佳，在这些情况下，不正确的对应关系仍可能获得高相似度分数。这主要是因为传统模型仅依赖于特征相似性，缺乏明确的机制来估计预测匹配的可靠性，从而导致过于自信的错误。为了解决这个问题，我们提出了SURE，一个半稠密不确定性精炼匹配框架，该框架通过建模随机不确定性（aleatoric）和认知不确定性（epistemic）共同预测对应关系及其置信度。我们的方法引入了一种新颖的证据头（evidential head）用于可信的坐标回归，并配备了一个轻量级空间融合模块，以最小的开销增强局部特征的精度。我们在多个标准基准上评估了我们的方法，结果显示其在准确性和效率上始终优于现有的最先进半稠密匹配模型。我们的代码将发布在 https://github.com/LSC-ALAN/SURE。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2603.04870

Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning

基于扩散的sRGB真实噪声生成通过提示驱动的噪声表示学习

Ko, Jaekyun, Kim, Dongjin, Lee, Soomin, Wang, Guanghui, Kim, Tae Hyun

Abstract

Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.

Chinese Translation

在sRGB图像空间中去噪是一个具有挑战性的任务，因为噪声的变异性。尽管端到端的方法表现良好，但在真实场景中的有效性受到真实噪声-干净图像对稀缺性的限制，这些图像对的收集既昂贵又困难。为了解决这一限制，已经开发了几种生成方法，从有限的数据中合成真实的噪声图像。这些生成方法通常在训练和测试过程中依赖于相机元数据来合成真实世界的噪声。然而，元数据的缺乏或设备之间的不一致性限制了它们的可用性。因此，我们提出了一种新颖的框架，称为提示驱动噪声生成（Prompt-Driven Noise Generation，PNG）。该模型能够获取高维提示特征，捕捉真实输入噪声的特征，并生成与输入噪声分布一致的多种真实噪声图像。通过消除对显式相机元数据的依赖，我们的方法显著增强了噪声合成的普适性和适用性。全面的实验表明，我们的模型有效地生成了真实的噪声图像，并展示了这些生成图像在去除各种基准数据集中的真实世界噪声方面的成功应用。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2603.04874

Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics

可解释的预发布棒球投球类型预测：基于广播3D运动学

Bright, Jerrin, Lu, Michelle, Zelek, John

Abstract

How much can a pitcher's body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4\% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9\% of the predictive signal versus 35.1\% for the lower body, with wrist position (14.8\%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80\% and delineating where kinematic information ends and ball-flight information begins.

Chinese Translation

投手的身体能揭示多少关于即将投出的球的信息？我们通过对单目3D姿态序列进行分类，研究了这一问题，分类了八种投球类型，而无需访问球的飞行数据。我们的流程将基于扩散的3D姿态骨干网络与自动投球事件检测、经过真实数据验证的生物力学特征提取以及对229个运动学特征的梯度提升分类相结合。在119,561个职业投球的评估中，这是迄今为止最大的基准，我们仅使用身体运动学就达到了80.4%的准确率。系统的重要性分析表明，上半身的力学特征贡献了64.9%的预测信号，而下半身则为35.1%，其中手腕位置（14.8%）和躯干侧倾被确定为最具信息量的关节组和生物力学特征。我们进一步展示了以握法定义的变体（四缝与二缝快速球）无法从姿态中分离，建立了一个接近80%的经验上限，并划定了运动学信息结束与球飞行信息开始的界限。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2603.04878

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

基于结构观察驱动的图像-文本对比学习用于计算机断层扫描报告生成

Liu, Hong, Wei, Dong, Peng, Qiong, Huang, Yawen, Wu, Xian, Zheng, Yefeng, Wang, Liansheng

Abstract

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

Chinese Translation

计算机断层扫描报告生成（CTRG）旨在自动化临床放射学报告的撰写过程，从而减轻报告撰写的工作负担并促进患者护理。尽管深度学习方法在X光报告生成方面取得了显著进展，但由于CT图像的数据量更大以及描述所需的细节更为复杂，其在CTRG中的有效性可能受到限制。本研究提出了一种新颖的两阶段（结构学习和报告学习）框架，专为CTRG量身定制，具有有效的结构级图像-文本对比。在第一阶段，一组可学习的结构特定视觉查询观察CT图像中的相应结构。生成的观察令牌与从附带的放射学报告中提取的结构特定文本特征通过结构级图像-文本对比损失进行对比。此外，提出基于文本-文本相似性的软伪目标，以减轻假阴性的影响，即来自非配对图像和报告的语义相同的图像结构和文本。因此，模型学习CT图像与报告之间的结构级语义对应关系。此外，提出了一种动态的、多样性增强的负样本队列，以指导网络学习区分各种异常。在第二阶段，视觉结构查询被冻结，并用于选择描绘每个解剖结构的关键图像补丁嵌入，最小化来自无关区域的干扰，同时减少内存消耗。同时，增加了一个文本解码器并进行训练以生成报告。我们在两个公共数据集上的广泛实验表明，我们的框架在临床效率方面建立了CTRG的新最先进性能，其各个组件也表现出有效性。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2603.04882

DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

DeformTrace：一种具有中继令牌的可变形状态空间模型用于时间伪造定位

Zhu, Xiaodong, Wang, Suting, Zheng, Yuanming, Yang, Junqi, Liao, Yangxu, Yang, Yuhong, Tu, Weiping, Wang, Zhongyuan

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.

Chinese Translation

时间伪造定位（TFL）旨在准确识别视频和音频中被操纵的片段，为安全和法医学提供强有力的可解释性。尽管近期的状态空间模型（SSMs）在精确的时间推理中显示出良好的前景，但其在TFL中的应用受到模糊边界、稀疏伪造和有限长程建模的限制。我们提出了DeformTrace，通过引入可变形动态和中继机制来增强SSMs，以应对这些挑战。具体而言，可变形自状态空间模型（DS-SSM）为SSMs引入动态感受野，以实现精确的时间定位。为了进一步增强其时间推理能力并减轻长程衰减，DS-SSM中集成了中继令牌机制。此外，可变形交叉状态空间模型（DC-SSM）将全局状态空间划分为特定查询的子空间，从而减少非伪造信息的积累并提高对稀疏伪造的敏感性。这些组件被集成到一个混合架构中，该架构结合了变换器的全局建模能力和SSMs的高效性。大量实验表明，DeformTrace在参数更少、推理更快和鲁棒性更强的情况下实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2603.04887

Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

联邦特定模态编码器与部分个性化融合解码器用于多模态脑肿瘤分割

Liu, Hong, Wei, Dong, Dai, Qian, Wu, Xian, Zheng, Yefeng, Wang, Liansheng

Abstract

Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.

Chinese Translation

现有的大多数用于医学图像分析的联邦学习（FL）方法仅考虑了模态内部的异质性，这限制了它们在多模态成像应用中的适用性。在实际应用中，一些FL参与者可能仅拥有完整成像模态的子集，这使得模态间的异质性成为有效训练全球模型的挑战。同时，每个参与者期望在FL中获得一个针对其本地数据特征量身定制的个性化模型。本研究提出了一种新的FL框架，具有联邦特定模态编码器和部分个性化多模态融合解码器（FedMEPD），以解决这两个并发问题。具体而言，FedMEPD为每种模态采用独立的编码器，以考虑模态间的异质性。虽然这些编码器是完全联邦的，但解码器是部分个性化的，以满足个体需求——利用全球与本地参数更新之间的差异动态确定哪些解码器过滤器是个性化的。在实现方面，拥有完整模态数据的服务器使用融合解码器融合来自所有特定模态编码器的表示，从而连接模态，通过反向传播优化编码器。此外，从融合的多模态表示中提取多个锚点，并将其与模型参数一起分发给客户端。相反，具有不完整模态的客户端通过缩放点积交叉注意力将其缺失模态的表示校准到全球全模态锚点，以弥补因缺失模态而导致的信息损失。FedMEPD在BraTS 2018和2020多模态脑肿瘤分割基准上进行了验证。结果表明，它在多模态和个性化FL方面优于多种最新方法，其新颖设计也被证明是有效的。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2603.04892

Locality-Attending Vision Transformer

局部关注视觉变换器

Hajimiri, Sina, Beizaee, Farzad, Shakeri, Fereshteh, Desrosiers, Christian, Ayed, Ismail Ben, Dolz, Jose

Abstract

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.

Chinese Translation

视觉变换器通过利用全局自注意力来捕捉长距离依赖关系，在分类任务中取得了显著成功。然而，这种机制可能会掩盖对于分割等任务至关重要的细粒度空间细节。在本研究中，我们旨在提升视觉变换器在标准图像级分类训练后的分割性能。更具体地说，我们提出了一种简单而有效的附加模块，能够在保持视觉变换器图像级识别能力的同时，改善分割任务的性能。在我们的方法中，我们使用可学习的高斯核来调节自注意力，使其偏向于邻近的图块。我们进一步优化图块表示，以便在图块位置学习更好的嵌入。这些修改鼓励标记关注局部环境，并确保在空间位置上有意义的表示，同时仍然保留模型整合全局信息的能力。实验结果证明了我们修改的有效性，在三个基准测试上（例如，ViT Tiny 和 Base 在 ADE20K 上分别提高超过 6% 和 4% 的分割性能），而不改变训练机制或牺牲分类性能。代码可在 https://github.com/sinahmr/LocAtViT/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2603.04899

FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation

FC-VFI：高保真且一致的视频帧插值用于高帧率慢动作视频生成

Ding, Ganggui, Chen, Hao, Xu, Xiaogang

Abstract

Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting $4\times$x and $8\times$ interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at $2560\times 1440$resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.

Chinese Translation

大型预训练视频扩散模型在视频帧插值方面表现出色，但由于依赖内在生成先验，难以生成高保真度的帧，从而限制了起始帧和结束帧的细节保留。现有方法通常依赖于运动控制以实现时间一致性，但密集光流易出错，而稀疏点缺乏结构上下文。本文提出FC-VFI，用于高保真且一致的视频帧插值，支持4倍和8倍插值，将帧率从30 FPS提升至120 FPS和240 FPS，分辨率为2560×1440，同时保持视觉保真度和运动一致性。我们引入了一种时间建模策略，利用潜在序列从起始帧和结束帧继承保真度线索，并利用语义匹配线进行结构感知的运动引导，从而改善运动一致性。此外，我们提出了一种时间差损失，以减轻时间不一致性。大量实验表明，FC-VFI在多种场景下实现了高性能和结构完整性。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2603.04908

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

AdaIAT：自适应增加对生成文本的关注以缓解大型视觉语言模型中的幻觉

Zhong, Li'an, He, Ziqiang, Zheng, Jibin, Li, Jin, Wang, Z. Jane, Kang, Xiangui

Abstract

Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

Chinese Translation

幻觉一直是当前大型视觉语言模型（LVLM）发展和应用的重大障碍。为了减轻幻觉，直接在推理过程中增加对图像标记的注意力权重是一种直观且有效的方法。尽管这有效降低了幻觉率，但往往会导致重复描述。为了解决这个问题，我们首先对注意力模式进行了分析，发现真实物体标记相较于幻觉标记更倾向于对生成文本分配更高的注意力。这启发我们利用生成文本中包含的与指令相关的视觉信息和上下文知识，以缓解幻觉，同时保持语言的连贯性。因此，我们提出了对生成文本的注意力（IAT），并证明它显著降低了幻觉率，同时避免了重复描述。为了防止简单的放大影响LVLM的固有预测能力，我们进一步探索了自适应IAT（AdaIAT），该方法采用逐层阈值控制干预时间和针对每个注意力头特征的细粒度放大幅度。分析和实验均表明AdaIAT的有效性。多个LVLM的结果显示，AdaIAT有效减轻了幻觉（在LLaVA-1.5上幻觉率$C_S$和$C_I$分别降低了35.8%和37.1%），同时保持了语言表现和预测能力，实现了良好的平衡。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2603.04938

Person Detection and Tracking from an Overhead Crane LiDAR

基于吊车激光雷达的人体检测与追踪

Jayawickrama, Nilusha, Toikka, Henrik, Ojala, Risto

Abstract

This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research

Chinese Translation

本文研究了在工业室内工作空间中使用安装在吊车上的激光雷达进行人体检测与追踪。吊车的高空视角引入了与常见的以车辆为中心的激光雷达基准测试之间的显著领域转变，并且适合的公共训练数据的可用性有限。因此，我们策划了一个特定场地的高空激光雷达数据集，包含3D人体边界框标注，并在统一的训练与评估协议下调整选定的候选3D检测器。我们进一步集成了基于检测的轻量级追踪方法，使用AB3DMOT和SimpleTrack来保持人体身份的连续性。通过距离切片评估报告检测性能，以量化传感器设置的实际操作范围。最佳适配的检测器配置在5.0米的水平半径内实现了高达0.84的平均精度（AP），在1.0米时提高至0.97，其中VoxelNeXt和SECOND在这一范围内表现出最可靠的骨干网络。这些结果有助于缩小标准驾驶数据集与高空传感器在人体检测与追踪中的领域差距。我们还报告了延迟测量，突出了实际实时可行性。最后，我们在GitHub上发布了我们的数据集和实现，以支持进一步的研究。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2603.04947

Adaptive Prototype-based Interpretable Grading of Prostate Cancer

基于原型的自适应可解释前列腺癌分级

Bhattacharyya, Riddhasree, Dutta, Pallabi, Mitra, Sushmita

Abstract

Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.

Chinese Translation

前列腺癌是男性中最常被诊断的恶性肿瘤之一，活检的需求日益增加，给病理学家带来了巨大的工作负担。分级过程繁琐且主观，这促使了自动化系统的发展。尽管深度学习在性能方面取得了一定进展，但其有限的可解释性在医学等高风险应用中的广泛采用面临挑战。现有的前列腺癌分类器的可解释性技术提供了粗略的解释，但未能揭示突出区域的重要性。在这种情况下，我们提出了一种新颖的基于原型的弱监督框架，用于从组织病理学图像中进行前列腺癌的可解释分级。这些网络可以被认为更为可靠，因为它们的显式推理过程反映了病理学家在比较可疑区域与临床验证示例时的工作流程。网络最初在补丁级别进行预训练，以学习与每个分级相关的稳健原型特征。为了使其适应前列腺癌分级的弱监督设置，网络通过新的原型感知损失函数进行微调。最后，引入了一种基于注意力的动态剪枝机制，以处理样本间的异质性，同时有选择地强调相关原型以实现最佳性能。在基准数据集PANDA和SICAP上的广泛验证确认该框架可以作为病理学家日常诊断工作流程中的可靠辅助工具。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2603.04950

Location-Aware Pretraining for Medical Difference Visual Question Answering

基于位置感知的医学差异视觉问答预训练

Musinguzi, Denis, Han, Caren, Mitra, Prasenjit

Abstract

Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.

Chinese Translation

与传统的单图像模型不同，差异医学视觉问答（VQA）框架处理多幅图像以识别差异，反映了放射科医生的比较诊断工作流程。然而，基于对比或分类目标训练的标准视觉编码器往往无法捕捉到区分疾病进展与获取差异所需的微妙视觉变化。为了解决这一局限性，我们提出了一种预训练框架，结合了位置感知任务，包括自动指称表达（AREF）、基础字幕生成（GCAP）和条件自动指称表达（CAREF）。这些特定任务使视觉编码器能够学习细粒度、空间上扎根的视觉表征，而这些表征通常被传统的预训练方法所忽视。随后，我们将这一增强的视觉编码器与语言模型集成，以执行医学差异视觉问答。实验结果表明，我们的方法在检测和推理胸部X光图像中临床相关变化方面达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2603.04957

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

VisionPangu：一个具有17亿参数的紧凑型细粒度多模态助手

Fan, Jiaxin, Song, Wenpo

Abstract

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

Chinese Translation

大型多模态模型（LMMs）在视觉语言理解方面取得了强大的性能，但许多现有方法依赖于大规模架构和粗略监督，这限制了它们生成详细图像描述的能力。在本研究中，我们提出了VisionPangu，一个紧凑型的17亿参数多模态模型，旨在通过高效的多模态对齐和高质量的监督来改善详细图像描述。我们的模型结合了基于InternVL的视觉编码器和OpenPangu-Embedded语言骨干，通过轻量级的多层感知器（MLP）投影器连接，并采用了受LLaVA启发的指令调优流程。通过整合来自DOCCI数据集的密集人类撰写描述，VisionPangu在不依赖激进模型扩展的情况下，提高了语义连贯性和描述丰富性。实验结果表明，紧凑型多模态模型能够在生成更结构化和详细的描述的同时，达到具有竞争力的性能。代码和模型权重将公开发布在 https://www.modelscope.cn/models/asdfgh007/visionpangu。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2603.04958

Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression

重新审视单目3D可变形模型回归的旧视角投影

Chong, Toby, Nakajima, Ryota

Abstract

We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.

Chinese Translation

我们提出了一种新颖的相机模型，用于单目3D可变形模型（3DMM）回归方法，有效捕捉在近距离面部图像中常见的透视畸变效应。将3D可变形模型拟合到视频中是内容创作中的一项关键技术。特别是，基于回归的方法通过将可变形模型的渲染输出与目标图像匹配，产生了快速且准确的结果。这些方法通常在正交投影下实现稳定的性能，正交投影消除了焦距与物体距离之间的模糊性。然而，这种简化使得它们不适用于近距离拍摄的画面，例如使用头戴式相机捕获的画面。我们通过引入一个新的收缩参数扩展了正交投影，结合伪透视效果，同时保持原始投影的稳定性。我们提出了几种技术，允许对现有模型进行微调，并通过使用与头戴式相机录制的自定义数据集进行定量和定性比较，展示了我们修改的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2603.04975

BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

BiEvLight：面向任务的双层事件精细化学习用于低光照图像增强

Yao, Zishu, Su, Xiang-Xiang, Zhou, Shengning, Chen, Guang-Yong, Fan, Guodong, Chen, Xing

Abstract

Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.

Chinese Translation

事件相机具有高动态范围，显示出在低光照图像增强（LLIE）方面的巨大潜力。现有研究主要集中在设计有效的模态融合策略。然而，一个关键挑战是事件中的内在背景活动（BA）噪声和图像中的低信噪比（SNR）导致的双重降解，这在模态融合过程中造成严重的噪声耦合，形成了关键的性能瓶颈。因此，我们认为精确的事件去噪是释放基于事件的融合全部潜力的前提。为此，我们提出了BiEvLight，一个分层的、面向任务的框架，通过利用增强和去噪之间的内在相互依赖性，协同优化这两个过程。具体而言，BiEvLight利用图像和事件之间强烈的梯度相关性，构建一个梯度引导的事件去噪先验，以缓解在高噪声区域的去噪不足。此外，我们不再将事件去噪视为一个静态的预处理阶段——这不可避免地导致过度去噪和不足去噪之间的权衡，并且无法适应特定增强目标的需求——而是将其重新表述为一个受增强任务约束的双层优化问题。通过跨任务交互，上层去噪问题学习到针对下层增强目标的事件表示，从而显著提高整体增强质量。在真实噪声数据集SDE上的大量实验表明，我们的方法显著优于最先进的（SOTA）方法，PSNR平均提升1.30dB，PSNR*提升2.03dB，SSIM提升0.047。代码将公开发布在 https://github.com/iijjlk/BiEvlight。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2603.04976

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT：基于视频的3D场景理解的强化微调

Linghu, Xiongkun, Huang, Jiangyong, Jia, Baoxiong, Huang, Siyuan

Abstract

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

Chinese Translation

可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）已成为提升大型语言模型（Large Language Models, LLMs）推理能力的变革性范式，但其在3D场景理解中的潜力仍未得到充分探索。现有方法主要依赖于监督微调（Supervised Fine-Tuning, SFT），其中基于标记的交叉熵损失作为优化的间接代理，导致训练目标与任务表现之间的不一致。为了解决这一问题，我们提出了基于视频的3D场景理解的强化微调（3D-RFT），这是第一个将RLVR扩展到基于视频的3D感知和推理的框架。3D-RFT通过直接优化模型以满足评估指标，改变了这一范式。3D-RFT首先通过SFT激活3D感知的多模态大型语言模型（Multi-modal Large Language Models, MLLMs），然后使用具有严格可验证奖励函数的群体相对策略优化（Group Relative Policy Optimization, GRPO）进行强化微调。我们从3D IoU和F1分数等指标直接设计任务特定的奖励函数，以提供更有效的信号来指导模型训练。大量实验表明，3D-RFT-4B在各种基于视频的3D场景理解任务中实现了最先进的性能。值得注意的是，3D-RFT-4B在3D视频检测、3D视觉定位和空间推理基准测试中显著超越了更大的模型（例如，VG LLM-8B）。我们进一步揭示了3D-RFT的良好特性，如稳健的有效性，以及对训练策略和数据影响的宝贵见解。我们希望3D-RFT能够作为未来3D场景理解发展的稳健和有前景的范式。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2603.04977

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

思考后再验证：一种用于长视频理解的假设验证多智能体框架

Wang, Zheng, Chen, Haoran, Qin, Haoxuan, Wei, Zhipeng, Qian, Tianwen, Bai, Cong

Abstract

Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.

Chinese Translation

长视频理解面临着密集的视觉冗余、长时间范围的依赖关系，以及基于链式思维和检索的智能体倾向于积累语义漂移和关联驱动错误的问题。我们认为，长视频推理应当从深思熟虑的任务表述开始，而非反应式的检索：模型必须首先阐明对于每个候选答案，视频中必须成立的条件。这个“思考后寻找”的原则激励了VideoHV-Agent的提出，该框架将视频问答重新构建为一个结构化的假设验证过程。基于视频摘要，思考者（Thinker）将答案候选重写为可测试的假设，评判者（Judge）推导出一个区分性线索，指定必须检查的证据，验证者（Verifier）利用局部的、细粒度的视频内容来验证和测试该线索，而答案智能体（Answer agent）则整合经过验证的证据以生成最终答案。在三个长视频理解基准上的实验表明，VideoHV-Agent在提供增强的可解释性、改善的逻辑严密性和较低的计算成本的同时，达到了最先进的准确性。我们的代码已公开可用，链接为：https://github.com/Haorane/VideoHV-Agent。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2603.04980

A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

通过简单的下一标记预测统一理解、生成和编辑的基线

Zhu, Jie, Ma, Hanghang, Wang, Jia, Guan, Yayong, Zeng, Yanbing, Gao, Lishuai, Wu, Junqiang, Hu, Jie, Wang, Leye

Abstract

In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.

Chinese Translation

在这项工作中，我们介绍了Wallaroo，这是一种简单的自回归基线，利用下一标记预测同时统一多模态理解、图像生成和编辑。此外，Wallaroo支持多分辨率的图像输入和输出，并且支持中英文双语。我们将视觉编码解耦为独立的路径，并应用四阶段训练策略来重塑模型的能力。在各种基准测试中进行的实验表明，Wallaroo的性能具有竞争力，甚至超过其他统一模型，表明自回归模型在统一多模态理解和生成方面具有巨大的潜力。我们的代码可在https://github.com/JiePKU/Wallaroo获取。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2603.04989

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TAPFormer：通过瞬态异步融合帧和事件实现稳健的任意点跟踪

Liu, Jiaxiong, Tan, Zhen, Zhang, Jinpu, Zhou, Yi, Shen, Hui, Chen, Xieyuanli, Hu, Dewen

Abstract

Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

Chinese Translation

任意点跟踪（TAP）是计算机视觉中的一项基本但具有挑战性的任务，要求高精度和长期运动推理。近期尝试将RGB帧和事件流结合的研究显示出良好的前景，但通常依赖于同步或非自适应的融合方式，导致时间对齐问题，并在某一模态失效时严重退化。我们提出了TAPFormer，一个基于变换器的框架，能够异步地进行时间一致的帧和事件融合，以实现稳健且高频的任意点跟踪。我们的关键创新是瞬态异步融合（TAF）机制，它通过连续的事件更新显式建模离散帧之间的时间演变，弥合低频率帧与高频率事件之间的差距。此外，交叉模态局部加权融合（CLWF）模块根据模态可靠性自适应调整空间注意力，即使在模糊或低光照条件下也能产生稳定且具有区分性的特征。为了在现实条件下评估我们的方法，我们构建了一个新的真实世界帧-事件TAP数据集，涵盖多种光照和运动条件。我们的方法在现有点跟踪器中表现优越，在阈值内实现了28.2%的平均像素误差改进。此外，在标准点跟踪基准测试中，我们的跟踪器始终表现出最佳性能。项目网站：tapformer.github.io

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2603.04993

MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

MultiGO++：通过几何-纹理协作实现单目3D穿衣人类重建

Yao, Nanjie, Zhang, Gangjian, Shen, Wenhao, Shu, Jian, Feng, Yu, Wang, Hao

Abstract

Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.

Chinese Translation

单目3D穿衣人类重建旨在从单张图像生成完整且逼真的纹理3D头像。现有方法通常在多视角监督下进行训练，并依赖带注释的几何先验，而在推理过程中，这些先验由预训练网络从单目输入中估计。这些方法受到三个关键限制的制约：在纹理方面，训练数据的缺乏；在几何方面，外部先验的不准确；在系统性方面，偏向单一模态的监督，所有这些都导致了次优重建。为了解决这些问题，我们提出了一种新颖的重建框架，命名为MultiGO++，该框架实现了有效的系统性几何-纹理协作。它由三个核心部分组成：(1) 一种多源纹理合成策略，构建了超过15,000个3D纹理人类扫描，以提高在挑战场景下的纹理质量估计性能；(2) 一种区域感知形状提取模块，提取并交互每个身体区域的特征，以获取几何信息，以及一种傅里叶几何编码器，减小模态差距以实现有效的几何学习；(3) 一种双重重建U-Net，利用几何-纹理协作特征来优化和生成高保真纹理3D人类网格。在两个基准测试和多个实际案例上的广泛实验表明，我们的方法优于最先进的技术。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2603.04999

Physics-consistent deep learning for blind aberration recovery in mobile optics

物理一致性深度学习在移动光学中的盲像差恢复

Jhawar, Kartik, Tandoc, Tamo Sancho Miguel, Xuan, Khoo Jun, Lipo, Wang

Abstract

Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these "black-box" models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.

Chinese Translation

移动摄影常常受到复杂的、特定于镜头的光学像差的限制。尽管最近的深度学习方法将其视为一个端到端的去模糊任务，但这些“黑箱”模型缺乏明确的光学建模，可能会幻觉出细节。相反，经典的盲去卷积方法仍然高度不稳定。为了填补这一空白，我们提出了Lens2Zernike，这是一个深度学习框架，可以从单个模糊图像中盲目恢复物理光学参数。据我们所知，之前的工作没有同时在三个不同的光学领域中整合监督。我们引入了一种新颖的物理一致性策略，通过直接的Zernike系数回归（z）、涵盖波前和点扩散函数推导的可微物理约束（p），以及辅助的多任务空间映射预测（m）显式最小化误差。通过对ResNet-18骨干网络的消融研究，我们证明我们的完整多任务框架（z+p+m）比仅使用系数的基线提高了35%。关键的比较分析显示，我们的方法在回归误差上显著优于文献中两种已建立的深度学习方法。最终，我们证明这些恢复的物理参数能够实现稳定的非盲去卷积，在专利的数字分子分析与科学研究所（IDMxS）移动相机镜头数据库中提供了显著的领域内改进，从严重像差的移动捕捉中恢复衍射极限细节。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2603.05010

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成图像修复的进展如何？关于其能力、局限性和评估实践的研究

Yin, Xiang, Hu, Jinfan, You, Zhiyuan, Yan, Kainan, Tang, Yu, Dong, Chao, Gu, Jinjin

Abstract

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

Chinese Translation

生成图像修复（Generative Image Restoration, GIR）在感知真实感方面取得了令人瞩目的成就，但与以往的方法相比，其实际能力究竟进展了多少？为了解答这个问题，我们提出了一项基于新的多维评估流程的大规模研究，该流程评估模型在细节、清晰度、语义正确性和整体质量等方面的表现。我们的分析涵盖了多种架构，包括基于扩散的、基于生成对抗网络（GAN）的、以峰值信噪比（PSNR）为导向的以及通用生成模型，揭示了关键的性能差异。此外，我们的分析还揭示了失败模式的关键演变，这标志着面向感知的低级视觉领域的范式转变。中心挑战是从以往的细节稀缺（生成不足）问题，演变到细节质量和语义控制的新前沿（防止生成过度）。我们还利用我们的基准训练了一个新的图像质量评估（IQA）模型，使其更好地与人类的感知判断相一致。最终，这项工作提供了对现代生成图像修复模型的系统研究，提供了重新定义我们对其真实状态理解的重要见解，并为未来的发展指明了方向。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2603.05012

Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

Tell2Adapt：一种通过视觉基础模型实现源无关无监督领域适应的统一框架

Shi, Yulong, Li, Shijie, Li, Ziyi, Qi, Lin

Abstract

Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.

Chinese Translation

源无关无监督领域适应（SFUDA）对于在多样化临床环境中部署深度学习模型至关重要。然而，现有方法通常针对低差距、特定领域的迁移而设计，无法推广到统一的多模态和多目标框架，这对实际应用构成了重大障碍。为了解决这个问题，我们提出了Tell2Adapt，一种新颖的SFUDA框架，利用视觉基础模型（VFM）广泛且可推广的知识。我们的方法通过上下文感知提示正则化（CAPR）确保高保真度的VFM提示，稳健地将多样的文本提示转换为规范指令。这使得能够生成高质量的伪标签，以高效地将轻量级学生模型适应于目标领域。为了确保临床可靠性，该框架结合了视觉合理性精炼（VPR），利用VFM的解剖知识将适应模型的预测重新定位于目标图像的低级视觉特征，有效去除噪声和假阳性。我们进行了迄今为止最广泛的SFUDA评估之一，验证了我们的框架在10个领域适应方向和22个解剖目标（包括脑部、心脏、息肉和腹部目标）上的表现。我们的结果表明，Tell2Adapt在医学图像分割的统一SFUDA框架中始终优于现有方法，达到了最新的技术水平（SOTA）。代码可在 https://github.com/derekshiii/Tell2Adapt 获取。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2603.05037

Generalizable Multiscale Segmentation of Heterogeneous Map Collections

异质地图集合的可推广多尺度分割

Petitpierre, Remi

Abstract

Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.

Chinese Translation

历史地图集合在风格、尺度和地理焦点上高度多样，通常由许多单页文档组成。然而，大多数地图识别的研究集中在针对同质地图系列的专业模型上。相比之下，本文旨在开发可推广的语义分割模型和本体。首先，我们介绍了Semap，一个新的开放基准数据集，包含1,439个手动标注的补丁，旨在反映历史地图文档的多样性。其次，我们提出了一种分割框架，该框架结合了程序化数据合成和多尺度集成，以提高鲁棒性和可转移性。该框架在HCMSSD和Semap数据集上实现了最先进的性能，表明以多样性驱动的地图识别方法不仅可行，而且有益。结果表明，分割性能在地图集合、尺度、地理区域和出版背景之间保持相对稳定。通过提出基准数据集和历史地图的通用分割方法，本研究为将制图档案的长尾整合到历史地理研究中开辟了道路。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2603.05041

Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation

利用光学相干断层成像中的中间重建进行医学图像分割的测试时适应

Pinetz, Thomas, Hucke, Veit, Bogunovic, Hrvoje

Abstract

Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.

Chinese Translation

初级医疗保健通常依赖于低成本成像设备，这些设备通常用于筛查目的。为了确保准确诊断，这些系统依赖于先进的重建算法，旨在近似高质量设备的性能。这类算法通常采用迭代重建方法，结合领域特定的先验知识。然而，下游任务的性能通常仅通过最终重建图像进行评估，从而忽视了在重建过程中生成的有信息的中间表示。在本研究中，我们提出了IRTTA，通过一个调制网络在测试时适应冻结下游网络的归一化层参数，以利用这些中间表示，该调制网络根据当前重建时间尺度进行调节。调制网络在测试时通过对所有单个时间步的平均熵损失进行学习。时间步分割之间的变化还提供了不增加额外成本的uncertainty估计。这种方法增强了分割性能，并使得语义上有意义的不确定性估计成为可能，且无需修改重建过程或下游模型。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2603.05042

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

CoIn3D：重新审视配置不变的多摄像头三维物体检测

Kuang, Zhaonian, Ding, Rui, Wang, Haotian, Zheng, Xinhu, Yang, Meng, Hua, Gang

Abstract

Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Pl\"ucker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

Chinese Translation

多摄像头三维物体检测（MC3D）随着多传感器物理代理（如机器人和自动驾驶车辆）的日益普及而受到越来越多的关注。然而，MC3D模型在面对新多摄像头配置的未见平台时仍然难以泛化。目前的解决方案仅采用元摄像头进行统一表示，但缺乏全面的考虑。本文重新审视了这一问题，并指出问题的关键在于源配置和目标配置之间的空间先验差异，包括不同的内参、外参和阵列布局。为此，我们提出了CoIn3D，一个可泛化的MC3D框架，能够实现从源配置到未见目标配置的强转移能力。CoIn3D通过空间感知特征调制（SFM）和摄像头感知数据增强（CDA）分别将所有识别出的空间先验显式地融入特征嵌入和图像观察中。SFM通过整合四种空间表示（如焦距、地面深度、地面梯度和普鲁克坐标）来丰富特征空间。CDA通过一种无训练的动态新视图图像合成方案，在各种配置下提高观察的多样性。大量实验表明，CoIn3D在NuScenes、Waymo和Lyft等地标数据集上，在BEVDepth、BEVFormer和PETR三种主流MC3D范式下，达到了强大的跨配置性能。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2603.05053

CLIP-driven Zero-shot Learning with Ambiguous Labels

基于CLIP的模糊标签零-shot学习

Fan, Jinfu, Li, Jiangnan, Yan, Xiaowen, Zhong, Xiaohui, Lu, Wenpeng, Huang, Linqing

Abstract

Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.

Chinese Translation

零-shot学习（ZSL）旨在通过利用已见类别的语义信息来识别未见类别，但大多数现有方法假设训练实例具有准确的类别标签。然而，在现实场景中，噪声和模糊标签会显著降低ZSL的性能。为此，我们提出了一种新的基于CLIP的部分标签零-shot学习（CLIP-PZSL）框架，以处理标签模糊性。首先，我们使用CLIP提取实例和标签特征。然后，语义挖掘模块融合这些特征以提取区分性的标签嵌入。我们还引入了一种部分零-shot损失，根据候选标签与实例的相关性为其分配权重，并对实例和标签嵌入进行对齐，以最小化语义不匹配。随着训练的进行，真实标签逐渐被识别，经过精炼的标签和标签嵌入反过来有助于改善实例和标签特征的语义对齐。在多个数据集上的全面实验表明了CLIP-PZSL的优势。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2603.05058

A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset

基于颜色注意力RT-DETR和ABLDataset的360度多摄像头系统用于蓝色紧急灯检测

Vacalebri-Lloret, Francisco, Banchero, Lucas, Lopez, Jose J., Mossi, Jose M.

Abstract

This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.

Chinese Translation

本研究提出了一种先进的系统，用于检测紧急车辆上的蓝色灯光，该系统基于ABLDataset开发，该数据集包含了在各种气候和地理条件下的欧洲紧急车辆图像。该系统采用四个鱼眼摄像头的配置，每个摄像头具有180度的水平视场，安装在车辆的侧面。通过校准过程实现了检测的方位定位。此外，还对主要深度神经网络算法进行了比较分析，包括YOLO（v5、v8和v10）、RetinaNet、Faster R-CNN和RT-DETR。选择RT-DETR作为基础模型，并通过引入颜色注意力模块进行增强，在测试集上实现了94.7%的准确率和94.1%的召回率，现场测试检测距离达到70米。此外，该系统通过几何变换估计紧急车辆相对于汽车中心的接近角度。该系统旨在集成到一个结合视觉和声学数据的多模态系统中，已显示出高效性，为增强高级驾驶辅助系统（ADAS）和道路安全提供了有前景的方法。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2603.05071

MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration

MI-DETR：一种基于生物启发的运动集成的红外小目标检测强基线

Liu, Nian, Gao, Jin, Lin, Shubo, Kou, Yutong, Zhang, Sikui, Ge, Fudong, Pu, Zhiqiang, Li, Liang, Wang, Gang, Wang, Yizheng, Hu, Weiming

Abstract

Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.

Chinese Translation

红外小目标检测（ISTD）具有挑战性，因为微小且低对比度的目标容易被复杂和动态的背景遮挡。传统的多帧方法通常通过深度神经网络隐式学习运动，往往需要额外的运动监督或显式对齐模块。我们提出了运动集成 DETR（MI-DETR），这是一种生物启发的双通道检测器，它在每个时间步处理一帧红外图像，同时显式建模运动。首先，受视网膜启发的细胞自动机（RCA）将原始帧序列转换为在与外观图像相同像素网格上定义的运动图，从而使得类小细胞（parvocellular）外观和类大细胞（magnocellular）运动通路可以通过一组边界框进行监督，而无需额外的运动标签或对齐操作。其次，类小细胞-类大细胞互连（PMI）模块促进了两个通路之间的双向特征交互，提供了一种生物学上合理的中间互连机制。最后，RT-DETR 解码器在两个通路的特征上运行，以产生检测结果。令人惊讶的是，我们提出的简单而有效的方法在三个常用的 ISTD 基准上表现出色。MI-DETR 在 IRDST-H 上达到了 70.3% mAP@50 和 72.7% F1（比最佳多帧基线提高了 26.35 mAP@50），在 DAUB-R 上达到了 98.0% mAP@50，在 ITSDT-15K 上达到了 88.3% mAP@50，展示了生物启发的运动-外观集成的有效性。代码可在 https://github.com/nliu-25/MI-DETR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2603.05075

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

UniM：统一的任意对任意交错多模态基准

Li, Yanlin, Guo, Minghui, Zhang, Kaiwen, Zhang, Shize, Zhao, Yiran, Li, Haodong, Zhou, Congyue, Zheng, Weijie, Yan, Yushen, Wu, Shengqiong, Ji, Wei, Cui, Lei, Wei, Furu, Fei, Hao, Lee, Mong-Li, Hsu, Wynne

Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

Chinese Translation

在现实世界的多模态应用中，系统通常需要理解用户任意组合和交错的多模态输入，同时以任意交错的多媒体形式生成输出。这种能力定义了在统一理解和生成范式下的任意对任意交错多模态学习的目标，为推动多模态大型语言模型（MLLMs）带来了新的挑战和机遇。为了促进和评估这种能力，本文介绍了UniM基准，这是第一个统一的任意对任意交错多模态数据集。UniM包含来自30个领域的31K高质量实例和7种代表性模态：文本、图像、音频、视频、文档、代码和3D，每种模态都需要多种交织的推理和生成能力。我们进一步介绍了UniM评估套件，该套件从三个维度评估模型：语义正确性与生成质量、响应结构完整性和交错一致性。此外，我们提出了UniMA，一个具备可追溯推理的代理基线模型，用于结构化交错生成。全面的实验表明了UniM的难度，并突出了推动统一任意对任意多模态智能的关键挑战和方向。项目页面为 https://any2any-mllm.github.io/unim。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2603.05078

MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer

MoRe：运动感知前馈4D重建变换器

Fang, Juntong, Chen, Zequn, Zhang, Weiqi, Di, Donglin, Zhang, Xuancheng, Yang, Chengmin, Liu, Yu-Shen

Abstract

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.

Chinese Translation

重建动态4D场景仍然具有挑战性，因为移动物体会干扰相机姿态估计。现有的优化方法通过额外的监督来缓解这一问题，但它们大多数计算开销大且在实时应用中不切实际。为了解决这些局限性，我们提出了MoRe，一种前馈4D重建网络，能够高效地从单目视频中恢复动态3D场景。MoRe建立在强大的静态重建骨干网络之上，采用注意力强制策略将动态运动与静态结构分离。为了进一步增强鲁棒性，我们在包含动态和静态场景的大规模多样化数据集上对模型进行了微调。此外，我们的分组因果注意力捕捉时间依赖性，并适应跨帧变化的标记长度，确保时间上连贯的几何重建。在多个基准测试上的广泛实验表明，MoRe以卓越的效率实现了高质量的动态重建。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2603.05081

Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

正交时空分布转移用于4D生成

Liu, Wei, Wu, Shengqiong, Li, Bobo, Zhao, Haoyu, Fei, Hao, Lee, Mong-Li, Hsu, Wynne

Abstract

In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

Chinese Translation

在AIGC时代，生成高质量的4D内容引起了越来越多的研究关注。不幸的是，目前的4D合成研究受到缺乏大规模4D数据集的严重限制，阻碍了模型充分学习高质量4D生成所需的关键时空特征，从而妨碍了该领域的发展。为了解决这一问题，我们提出了一种新颖的框架，该框架从现有的3D扩散模型中转移丰富的空间先验，并从视频扩散模型中转移时间先验，以增强4D合成。我们开发了一种时空解耦的4D (STD-4D) 扩散模型，通过解耦的空间和时间潜变量合成具有4D意识的视频。为了促进最佳特征转移，我们设计了一种新颖的正交时空分布转移 (Orster) 机制，在该机制中，时空特征分布被仔细建模并注入到STD-4D扩散中。此外，在4D构建过程中，我们设计了一种时空感知的HexPlane (ST-HexPlane)，以整合转移的时空特征，从而改善4D变形和4D高斯特征建模。实验表明，我们的方法显著优于现有方法，实现了更好的时空一致性和更高质量的4D合成。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2603.05095

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

GEM-TFL：通过EM引导分解和时间精炼桥接弱监督与完全监督的伪造定位

Zhu, Xiaodong, Zheng, Yuanming, Wang, Suting, Yang, Junqi, Yang, Yuhong, Tu, Weiping, Wang, Zhongyuan

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

Chinese Translation

时间伪造定位（TFL）旨在精确识别视频或音频流中的操控片段，为多媒体取证和安全提供可解释的证据。虽然大多数现有的TFL方法依赖于完全监督下的密集帧级标签，但弱监督TFL（WS-TFL）通过仅从二元视频级标签中学习来降低标注成本。然而，当前的WS-TFL方法面临训练和推理目标不匹配、来自二元标签的监督有限、由于不可微分的top-k聚合导致的梯度阻塞，以及缺乏对提案间关系的明确建模等问题。为了解决这些问题，我们提出了GEM-TFL（基于图的EM驱动时间伪造定位），这是一个有效桥接训练与推理之间监督差距的两阶段分类-回归框架。在此基础上，（1）我们通过基于EM的优化过程将二元标签重新构造为多维潜在属性，从而增强弱监督；（2）我们引入了一种无训练的时间一致性精炼方法，以重新对齐帧级预测，实现更平滑的时间动态；（3）我们设计了一个基于图的提案精炼模块，建模提案之间的时间-语义关系，以实现全局一致的置信度估计。在基准数据集上的广泛实验表明，GEM-TFL在时间伪造定位方面实现了更准确和更稳健的结果，显著缩小了与完全监督方法之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2603.05105

Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search

Diff-ES：通过进化搜索的分阶段结构化扩散剪枝

Liu, Zongfang, Tang, Shengkun, Wu, Zongliang, Yuan, Xin, Shen, Zhiqiang

Abstract

Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.

Chinese Translation

扩散模型在高保真图像生成方面取得了显著成功，但由于其多步去噪过程和大型模型尺寸，计算需求仍然很高。尽管先前的研究通过减少采样步骤或压缩模型参数来提高效率，但现有的结构化剪枝方法仍然难以平衡实际加速与图像质量的保持。尤其是，像 MosaicDiff 这样的先前方法依赖于启发式的、手动调整的分阶段稀疏性调度，并在推理过程中拼接多个独立剪枝的模型，这增加了内存开销。然而，扩散步骤的重要性是高度不均匀且依赖于模型的。因此，基于简单启发式或经验观察得出的调度往往无法推广，可能导致次优性能。为此，我们提出了 extbf{Diff-ES}，一种通过 extbf{E}volutionary extbf{S}earch 的分阶段结构化 extbf{Diff}usion 剪枝框架，旨在优化分阶段稀疏性调度，并通过内存高效的权重路由执行该调度，而无需模型复制。Diff-ES 将扩散轨迹划分为多个阶段，通过进化搜索自动发现最佳的分阶段稀疏性调度，并动态激活阶段条件权重，而无需复制模型参数。我们的框架自然与现有的扩散模型结构化剪枝方法集成，包括深度和宽度剪枝。在 DiT 和 SDXL 上的广泛实验表明，Diff-ES 一直能够实现时钟速度的提升，同时在生成质量上仅产生最小的降级，确立了结构化扩散模型剪枝的最新性能。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2603.05110

BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity

BLINK：NK细胞细胞毒性行为潜在建模

Nematollahi, Iman, Villena-Ossa, Jose Francisco, Moter, Alina, Farhadyar, Kiana, Kalweit, Gabriel, Valada, Abhinav, Cathomen, Toni, Ullrich, Evelyn, Kalweit, Maria

Abstract

Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.

Chinese Translation

细胞相互作用动态的机器学习模型有助于理解细胞行为。自然杀伤（NK）细胞的细胞毒性是这种相互作用动态的一个显著例子，通常通过时间分辨的多通道荧光显微镜进行研究。尽管肿瘤细胞死亡事件可以在单帧中进行注释，但NK细胞的细胞毒性结果是随着时间从细胞相互作用中产生的，无法仅通过逐帧分类可靠推断。我们提出了BLINK，这是一种基于轨迹的递归状态空间模型，作为NK-肿瘤相互作用的细胞世界模型。BLINK从部分观察到的NK-肿瘤相互作用序列中学习潜在的相互作用动态，并预测累积成细胞毒性结果的凋亡增量。在长期时间延续的NK-肿瘤录音实验中，BLINK显示出改进的细胞毒性结果检测，并能够预测未来结果，同时提供了一个可解释的潜在表示，将NK轨迹组织成一致的行为模式和时间结构化的相互作用阶段。BLINK为在单细胞水平上定量评估和结构建模NK细胞毒性行为提供了一个统一的框架。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2603.05114

UniPAR: A Unified Framework for Pedestrian Attribute Recognition

UniPAR：一个统一的行人属性识别框架

Xu, Minghe, Wu, Rouying, Xu, Jiarui, Sun, Minhao, Yan, Zikang, Wang, Xiao, Chu, ChiaWei, Li, Yu

Abstract

Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset" paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model's cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR

Chinese Translation

行人属性识别是计算机视觉中的一个基础任务，为下游应用提供了重要支持，包括视频监控中的人物检索和智能零售分析。然而，现有研究通常受到“每个数据集一个模型”范式的限制，并且在处理不同领域之间在模态、属性定义和环境场景方面的显著差异时面临困难。为了解决这些挑战，我们提出了UniPAR，一个基于Transformer的统一框架用于行人属性识别（PAR）。通过结合统一的数据调度策略和动态分类头，UniPAR使单个模型能够同时处理来自异构模态的多样数据集，包括RGB图像、视频序列和事件流。我们还引入了一种创新的分阶段融合编码器，通过后期深度融合策略明确对齐视觉特征与文本属性查询。对广泛使用的基准数据集（包括MSP60K、DukeMTMC和EventPAR）的实验结果表明，UniPAR的性能与专门的最先进方法相当。此外，多数据集联合训练显著增强了模型的跨域泛化能力和在低光照和运动模糊等极端环境下的识别鲁棒性。本文的源代码将发布在https://github.com/Event-AHU/OpenPAR

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2603.05135

SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning

SRasP：用于跨域少样本学习的自我重定向对抗风格扰动

Li, Wenqian, Fang, Pengfei, Xue, Hui

Abstract

Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.

Chinese Translation

跨域少样本学习（CD-FSL）旨在将知识从已见源域转移到未见目标域，是评估模型鲁棒性和可转移性的关键基准。现有的基于风格的扰动方法虽然能够缓解域偏移，但往往面临梯度不稳定和收敛到尖锐极小值的问题。为了解决这些局限性，我们提出了一种新颖的作物-全局风格扰动网络，称为自我重定向对抗风格扰动（SRasP）。具体而言，SRasP利用全局语义指导来识别不一致的作物，然后重新定向并聚合这些作物的风格梯度与图像内的全局风格梯度。此外，我们提出了一种新颖的多目标优化函数，以最大化视觉差异，同时强制全局、作物和对抗特征之间的语义一致性。在训练过程中应用稳定的扰动有助于收敛到更平坦且更具可转移性的解决方案，从而改善对未见域的泛化能力。在多个CD-FSL基准上进行了广泛的实验，结果显示在与最先进的方法相比时，SRasP consistently 提高了性能。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2603.05147

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models

行动、思考或 abstain：面向复杂性的自适应推理在视觉-语言-行动模型中的应用

Izzo, Riccardo Andrea, Bardaro, Gianluca, Matteucci, Matteo

Abstract

Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.

Chinese Translation

当前对视觉-语言-行动（VLA）模型的研究主要集中在通过已建立的推理技术来增强模型的泛化能力。尽管这些改进有效，但不可避免地增加了计算复杂性和推理延迟。此外，这些机制通常是无差别地应用，导致在处理琐碎任务时资源分配效率低下，同时未能提供防止在分布外任务中发生灾难性失败所需的不确定性估计。受到人类认知的启发，我们提出了一种自适应框架，该框架根据感知状态的复杂性动态路由 VLA 执行。我们的方法通过将潜在嵌入投影到一组参数和非参数估计器中，将 VLA 的视觉-语言骨干转变为一个主动检测工具。这使得系统能够立即执行已知任务（行动），对模糊场景进行推理（思考），并在遇到显著的物理或语义异常时预先中止执行（abstain）。在我们的实证分析中，我们观察到一个现象，即仅使用视觉嵌入在推断任务复杂性方面优于语言，因为语言的语义不变性。在 LIBERO 和 LIBERO-PRO 基准测试以及真实机器人上进行评估时，我们的仅视觉配置在使用仅 5% 的训练数据时达到了 80% 的 F1 分数，确立了其作为可靠且高效的任务复杂性检测器的地位。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2603.05152

SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction

SSR-GS：用于光滑表面重建的高斯点云中的镜面反射分离

Fan, Ningjing, Wang, Yiqun

Abstract

In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.

Chinese Translation

近年来，3D高斯点云（3DGS）在新视角合成方面取得了显著进展。然而，在复杂照明条件下准确重建光滑表面仍然具有挑战性，特别是在存在强镜面反射和多表面相互反射的场景中。为了解决这一问题，我们提出了SSR-GS，一个用于光滑表面重建的镜面反射建模框架。具体而言，我们引入了一种预过滤的Mip-Cubemap，以高效建模直接镜面反射，并提出了IndiASG模块以捕捉间接镜面反射。此外，我们设计了视觉几何先验（VGP），通过反射评分（RS）耦合反射感知视觉先验，以降低反射主导区域的光度损失贡献，同时结合来自VGGT的几何先验，包括逐渐衰减的深度监督和变换法线约束。在合成和真实世界数据集上的大量实验表明，SSR-GS在光滑表面重建方面实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2603.05157

The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis

预处理方法对胸部X光（CXR）诊断中种族编码和模型鲁棒性的影响

Sutariya, Dishantkumar, Petersen, Eike

Abstract

Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.

Chinese Translation

深度学习模型能够从胸部X光（CXR）记录中高精度地识别种族身份。因此，人们普遍关注种族快捷学习的潜在风险，即模型无意中学习到系统性地根据种族身份偏见其诊断预测。这种种族偏见威胁到医疗公平性和模型可靠性，因为模型可能系统性地误诊某些人口群体。由于种族快捷学习是分散的——非局部化并分布在整个CXR记录中——图像预处理方法可能会影响种族快捷学习，但此类方法在减少偏见方面的潜力仍未得到充分探索。在此，我们研究了包括肺部掩膜、肺部裁剪和对比度限制自适应直方图均衡（CLAHE）在内的图像预处理方法的效果。这些方法旨在抑制编码种族信息的虚假线索，同时保持诊断准确性。我们的实验表明，基于简单边界框的肺部裁剪可以成为减少种族快捷学习的有效策略，同时保持诊断模型性能，避免了常被假设的公平性与准确性之间的权衡。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2603.05159

Generic Camera Calibration using Blurry Images

基于模糊图像的通用相机标定

Shi, Zezhun

Abstract

Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the ef fectiveness of our approach.

Chinese Translation

相机标定是三维视觉的基础。与参数化相机标定相比，通用相机标定可以获得更准确的结果。然而，使用打印的标定板对通用相机模型进行标定所需的图像数量远超过参数化标定，这使得对于个体用户来说，运动模糊几乎是不可避免的。作为解决这一问题的首次尝试，我们借助几何约束和局部参数化照明模型，同时估计特征位置和空间变化的点扩散函数，同时解决传统图像去模糊任务中不需要考虑的平移模糊。实验结果验证了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2603.05181

Mario: Multimodal Graph Reasoning with Large Language Models

Mario：基于大语言模型的多模态图推理

Sun, Yuanfu, Li, Kang, Guo, Pengkang, Liu, Jiajin, Tan, Qiaoyu

Abstract

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

Chinese Translation

近期大语言模型（LLMs）的进展为多模态推理开辟了新的途径。然而，大多数现有方法仍依赖于预训练的视觉-语言模型（VLMs）来孤立地编码图像-文本对，忽视了现实世界多模态数据自然形成的关系结构。这促使我们在多模态图（MMGs）上进行推理，其中每个节点具有文本和视觉属性，边缘提供结构线索。在保持图拓扑的同时，使基于LLM的推理能够处理这种异构多模态信号，带来了两个关键挑战：解决弱跨模态一致性和处理异构模态偏好。为了解决这些问题，我们提出了Mario，一个统一框架，能够同时解决上述两个挑战，并实现对MMGs的有效LLM推理。Mario由两个创新阶段组成。首先，设计了一种图条件VLM，通过图拓扑引导的细粒度跨模态对比学习共同优化文本和视觉特征。其次，采用了一种模态自适应图指令调优机制，将对齐的多模态特征组织成图感知的指令视图，并使用可学习的路由器为每个节点及其邻域呈现最具信息性的模态配置给LLM。在多种MMG基准上进行的广泛实验表明，Mario在节点分类和链接预测的监督和零样本场景中始终优于最先进的图模型。代码将发布在 https://github.com/sunyuanfu/Mario。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2603.05184

Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule

Logi-PAR：通过可微分规则的逻辑注入患者活动识别

Zarar, Muhammad, Zhang, MingZheng, Zhang, Xiaowang, Feng, Zhiyong, Yitagesu, Sofonias, Farooq, Kawsar

Abstract

Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}

Chinese Translation

临床环境中的患者活动识别（PAR）利用活动数据来提高安全性和护理质量。尽管已经取得了显著进展，但当前模型主要识别正在发生的活动。它们通常使用全局和局部注意机制在空间上组合子稀疏视觉线索，但由于其神经网络管道，仅学习到逻辑隐含模式。提升临床安全性需要能够推断一组视觉线索为何暗示风险的方法，以及如何通过超越单纯分类的显式逻辑进行组合推理。为了解决这个问题，我们提出了Logi-PAR，这是第一个逻辑注入患者活动识别框架，它将上下文事实融合作为多视角原始提取器，并注入神经引导的可微分规则。我们的方法自动从视觉线索中学习规则，端到端优化，同时在训练过程中使隐含的出现模式能够被显式标记。据我们所知，Logi-PAR是第一个通过将可学习逻辑规则应用于符号映射来识别患者活动的框架。它生成可审计的“为何”解释作为规则追踪，并支持反事实干预（例如，如果有援助，风险将降低65%）。在临床基准（VAST和OmniFall）上的广泛评估表明其具有最先进的性能，显著超越了视觉-语言模型和变换器基线。代码可通过以下链接获取：https://github.com/zararkhan985/Logi-PAR.git

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2603.05202

Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation

用于去偏的半监督医学图像分割的语义类别分布学习

Su, Yingxue, Zhong, Yiheng, Zhu, Keying, Zhang, Zimu, Zhang, Zhuoru, Wang, Yifang, Zhang, Yuxin, Liu, Jingxin

Abstract

Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.

Chinese Translation

医学图像分割对于计算机辅助诊断至关重要。然而，密集的像素级标注既耗时又昂贵，医学数据集往往表现出严重的类别不平衡。这种不平衡导致少数类结构在特征表示中被主导类别所淹没，阻碍了判别特征的学习，使得可靠的分割特别具有挑战性。为了解决这个问题，我们提出了语义类别分布学习（Semantic Class Distribution Learning, SCDL）框架，这是一个即插即用的模块，通过学习结构化的类别条件特征分布来减轻监督和表示偏差。SCDL集成了类别分布双向对齐（Class Distribution Bidirectional Alignment, CDBA），以将嵌入与可学习的类别代理对齐，并利用语义锚约束（Semantic Anchor Constraints, SAC）通过标记数据指导代理。对Synapse和AMOS数据集的实验表明，SCDL在整体和类别级指标上显著提高了分割性能，尤其是在少数类上取得了显著提升，达到了最先进的结果。我们的代码已发布在https://github.com/Zyh55555/SCDL。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2603.05219

SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery

SPyCer：一种半监督物理引导的上下文注意力机制，用于从卫星图像估计近地表空气温度

Bouaziz, Sofiane, Hafiane, Adel, Canals, Raphael, Nedjai, Rachid

Abstract

Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.

Chinese Translation

现代地球观测依赖卫星捕捉详细的地表特性。然而，许多影响人类和生态系统的现象发生在靠近地表的大气中。近地传感器能够准确测量某些环境特征，例如近地表空气温度（NSAT）。然而，这些传感器分布稀疏且不均匀，限制了它们提供连续空间测量的能力。为了解决这一问题，我们提出了SPyCer，一种半监督物理引导网络，能够利用像素信息和物理建模，通过有意义的物理特性引导学习过程。SPyCer旨在通过卫星图像对NSAT进行连续估计。SPyCer将NSAT预测框架视为一个逐像素的视觉问题，其中每个近地传感器被投影到卫星图像坐标上，并位于局部图像块的中心。相应的传感器像素通过观测到的NSAT和基于物理的约束进行监督，同时周围像素通过源自表面能量平衡和对流-扩散-反应偏微分方程的物理引导正则化进行贡献。为了捕捉邻近像素的物理影响，SPyCer采用了多头注意力机制，该机制由土地覆盖特征引导，并通过高斯距离加权进行调制。在真实世界数据集上的实验表明，SPyCer产生了空间一致且物理上合理的NSAT估计，在准确性、泛化能力和与基础物理过程的对齐方面超越了现有基线。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2603.05230

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字双胞胎驱动的自动化分拣系统中的纺织品分类与异物识别

Ergun, Serkan, Mitterer, Tobias, Zangl, Hubert

Abstract

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

Chinese Translation

对可持续纺织品回收的日益需求要求具备强大自动化解决方案的能力，以处理可变形服装并在杂乱环境中检测异物。本研究提出了一种数字双胞胎驱动的机器人分拣系统，该系统集成了抓取预测、多模态感知和语义推理，以实现现实世界中的纺织品分类。该系统采用双臂机器人单元，配备RGBD传感器、容性触觉反馈和碰撞感知运动规划，能够自主地将服装从未分类的篮子中分离，转移到检查区，并使用最先进的视觉语言模型（Visual Language Models, VLMs）进行分类。我们在包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试，这些场景包括衬衫、袜子、裤子、内衣、异物（包括不属于上述类别的服装）以及空场景。评估指标包括每类的准确性、幻觉行为和在实际硬件限制下的计算性能。结果表明，Qwen模型家族实现了最高的整体准确率（高达87.9%），并在异物检测性能上表现出色，而像Gemma3这样的轻量级模型则在边缘部署中提供了竞争性的速度与准确性权衡。结合MoveIt的数字双胞胎实现了碰撞感知路径规划，并将检查过的服装的分割3D点云集成到虚拟环境中，以提高操作的可靠性。所展示的系统证明了将语义VLM推理与传统抓取检测和数字双胞胎技术结合的可行性，从而实现可扩展的、自动化的纺织品分类，适用于现实工业环境。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2603.05255

CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception

CATNet：用于协同感知的协作对齐与转换网络

Chen, Gong, Zhang, Chaokun, Tang, Tao, Lv, Pengcheng, Li, Feng, Xie, Xin

Abstract

Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.

Chinese Translation

协同感知通过整合来自不同智能体的互补信息显著增强了场景理解。然而，现有研究往往忽视了现实世界多源数据集成中固有的关键挑战，特别是高时间延迟和多源噪声。为了解决这些实际限制，我们提出了协作对齐与转换网络（Collaborative Alignment and Transformation Network，CATNet），这是一种自适应补偿框架，旨在解决多智能体系统中的时间延迟和噪声干扰。我们的关键创新可以总结为三个方面。首先，我们引入了一种时空递归同步（Spatio-Temporal Recurrent Synchronization，STSync），通过相邻帧差分建模对异步特征流进行对齐，从而建立一个时空统一的表示空间。其次，我们设计了一种双分支小波增强去噪器（Dual-Branch Wavelet Enhanced Denoiser，WTDen），该去噪器能够抑制全局噪声并重建对齐表示中的局部特征失真。第三，我们构建了一种自适应特征选择器（Adaptive Feature Selector，AdpSel），该选择器动态关注关键感知特征以实现稳健的融合。在多个数据集上的大量实验表明，CATNet在复杂交通条件下始终优于现有方法，证明了其卓越的鲁棒性和适应性。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2603.05256

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Wiki-R1：通过数据和采样课程激励基于知识的多模态推理用于视觉问答

Ning, Shan, Qiu, Longtian, He, Xuming

Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.

Chinese Translation

基于知识的视觉问答（KB-VQA）要求模型通过整合外部知识来回答关于图像的问题，这由于噪声检索和知识库的结构化、百科全书式的特性而面临重大挑战。这些特性导致与预训练的多模态大语言模型（MLLMs）之间存在分布差距，使得在后训练阶段有效推理和领域适应变得困难。在本研究中，我们提出了 extit{Wiki-R1}，一种基于数据生成的课程强化学习框架，系统性地激励MLLMs在KB-VQA中的推理能力。Wiki-R1构建了一系列与模型不断发展的能力相一致的训练分布，弥合了从预训练到KB-VQA目标分布的差距。我们引入了 extit{可控课程数据生成}，该方法操控检索器以生成具有所需难度水平的样本，以及一种 extit{课程采样策略}，选择在强化学习更新过程中可能产生非零优势的信息样本。样本难度通过观察到的奖励进行估计，并传播到未观察样本以指导学习。在两个KB-VQA基准测试（百科全书VQA和InfoSeek）上的实验表明，Wiki-R1达到了新的最先进结果，提高了百科全书VQA的准确率从35.5\%提升至37.1\%，以及InfoSeek的准确率从40.1\%提升至44.1\%。项目页面可访问 https://artanic30.github.io/project_pages/WikiR1/。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2603.05280

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

逐层、逐模块：双重选择以优化ViT的OOD探测

Odonnat, Ambroise, Feofanov, Vasilii, Chapel, Laetitia, Tavenard, Romain, Redko, Ievgen

Abstract

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Chinese Translation

近期研究观察到，基础模型的中间层通常比最终层产生更具判别性的表示。虽然最初将其归因于自回归预训练，但这一现象也在通过监督和判别自监督目标训练的模型中得到了确认。在本文中，我们进行了一项全面研究，以分析预训练视觉变换器中间层的行为。通过在多样化的图像分类基准上进行广泛的线性探测实验，我们发现预训练与下游数据之间的分布变化是深层性能下降的主要原因。此外，我们在模块级别进行了细致分析。我们的发现表明，标准的变换器块输出探测并非最佳；相反，在显著的分布变化下，探测前馈网络中的激活能获得最佳性能，而在变化较弱时，多头自注意力模块的归一化输出则表现最佳。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2603.05305

Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation

Fusion4CA：通过全面利用图像提升3D物体检测

Luo, Kang, Chen, Xin, Xiao, Yangyi, Wang, Hesheng

Abstract

Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird's-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.

Chinese Translation

如今，越来越多的研究在鸟瞰视角（BEV）空间中融合LiDAR和RGB数据，以实现自动驾驶系统中的3D物体检测。然而，现有方法过于依赖LiDAR分支，对RGB信息的探索不足。为了解决这一问题，我们提出了Fusion4CA，该方法基于经典的BEVFusion框架，旨在充分利用视觉输入，并配备即插即用组件。具体而言，我们设计了一个对比对齐模块，以校准图像特征与3D几何体，并引入一个相机辅助分支，以在训练过程中充分挖掘RGB信息。为了进一步提升性能，我们利用现成的认知适配器，充分利用预训练的图像权重，并在融合阶段集成一个标准坐标注意力模块作为补充提升。在nuScenes数据集上的实验表明，我们的方法在仅经过6个训练周期的情况下实现了69.7%的mAP，并且推理参数仅增加了3.48%，相较于完全训练20个周期的基线提升了1.2%。在模拟月球环境中的广泛实验进一步验证了我们方法的有效性和泛化能力。我们的代码将通过Fusion4CA发布。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2603.05315

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

频率感知的误差界限缓存加速扩散变换器

Li, Guandong

Abstract

Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.

Chinese Translation

扩散变换器（Diffusion Transformers, DiTs）已成为高质量图像和视频生成的主流架构，但其迭代去噪过程在推理时会产生相当大的计算成本。现有的缓存方法通过在时间步之间重用中间计算来加速 DiTs，但它们有一个共同的局限性：将去噪过程视为在时间、深度和特征维度上均匀。本文识别了 DiT 去噪中的三个正交非均匀性轴：(1) 时间轴——对缓存错误的敏感性在去噪轨迹中变化显著；(2) 深度轴——连续的缓存决策导致级联近似误差；(3) 特征轴——隐藏状态的不同组件表现出异质的时间动态。基于这些观察，我们提出了 SpectralCache，一个统一的缓存框架，包括时间步感知动态调度（Timestep-Aware Dynamic Scheduling, TADS）、累积误差预算（Cumulative Error Budgets, CEB）和频率分解缓存（Frequency-Decomposed Caching, FDC）。在 512x512 分辨率的 FLUX.1-schnell 上，SpectralCache 实现了 2.46 倍的加速，LPIPS 为 0.217，SSIM 为 0.727，速度比 TeaCache（2.12 倍，LPIPS 0.215，SSIM 0.734）提高了 16%，同时保持了可比的质量（LPIPS 差异 < 1%）。我们的方法无需训练，插拔式，且与现有的 DiT 架构兼容。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2603.05330

Dark3R: Learning Structure from Motion in the Dark

Dark3R：在黑暗中从运动中学习结构

Guo, Andrew Y, Malik, Anagh, Tedla, SaiKiran, Dai, Yutong, Qin, Yiqian, Salehe, Zach, Attal, Benjamin, Nousias, Sotiris, Kutulakos, Kyros, Lindell, David B.

Abstract

We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

Chinese Translation

我们提出了Dark3R，这是一个在黑暗中进行运动结构恢复的框架，能够直接处理信噪比（SNR）低于-4 dB的原始图像——在这一范围内，传统的基于特征和学习的方法失效。我们的关键见解是通过教师-学生蒸馏过程，将大规模的3D基础模型适应于极低光照条件，从而实现低光照下的鲁棒特征匹配和相机姿态估计。Dark3R不需要3D监督；它仅在噪声-清晰的原始图像对上进行训练，这些图像对可以直接捕获或使用简单的泊松-高斯噪声模型合成，后者应用于曝光良好的原始图像。为了训练和评估我们的方法，我们引入了一个新的曝光包围数据集，其中包含约42,000个多视角原始图像及其真实的3D注释，并且我们证明Dark3R在低SNR范围内实现了最先进的运动结构恢复。此外，我们还展示了使用Dark3R预测的姿态和粗到细的辐射场优化过程，在黑暗中实现了最先进的新视图合成。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2603.05384

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

ORMOT：全向引用多目标跟踪的数据集与框架

Chen, Sijia, Zhou, Zihan, Yu, Yanqiu, Yu, En, Tao, Wenbing

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.

Chinese Translation

多目标跟踪（MOT）是计算机视觉中的一项基础任务，旨在跨视频帧跟踪目标。现有的MOT方法在一般视觉场景中表现良好，但在扩展到视觉-语言设置时面临显著的挑战和局限性。为了解决这一问题，最近提出了引用多目标跟踪（RMOT）任务，旨在跟踪与语言描述相对应的对象。然而，目前的RMOT方法主要是在传统相机捕获的数据集上开发的，这些数据集存在视场（FoV）有限的问题。这一限制常常导致目标移出画面，从而造成跟踪碎片化和上下文信息的丢失。在本研究中，我们提出了一项新任务，称为全向引用多目标跟踪（ORMOT），该任务将RMOT扩展到全向图像，旨在克服传统数据集的视场限制，并提高模型理解长时域语言描述的能力。为了推进ORMOT任务，我们构建了ORSet，一个全向引用多目标跟踪数据集，包含27个多样化的全向场景、848个语言描述和3,401个标注对象，提供丰富的视觉、时间和语言信息。此外，我们提出了ORTrack，一个基于大型视觉-语言模型（LVLM）的框架，专门针对全向引用多目标跟踪。我们在ORSet数据集上进行了广泛的实验，证明了我们的ORTrack框架的有效性。数据集和代码将开源于 https://github.com/chen-si-jia/ORMOT。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2603.05386

Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations

融合-CAM：集成基于梯度和区域的类激活图以实现稳健的视觉解释

Dekdegue, Hajar, Garouani, Moncef, Mothe, Josiane, Bernigaud, Jordan

Abstract

Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.

Chinese Translation

解释深度卷积神经网络的决策过程仍然是实现可信和透明人工智能的核心挑战。可解释人工智能（XAI）技术，特别是类激活图（CAM）方法，被广泛采用以可视化影响模型预测的输入区域。基于梯度的方法（例如 Grad-CAM）通过计算类激活的梯度提供高度区分的细粒度细节，但往往产生嘈杂和不完整的图，强调的仅是最显著的区域，而不是完整的对象。基于区域的方法（例如 Score-CAM）则在更大区域上聚合信息，捕捉更广泛的对象覆盖，但代价是过度平滑和对微妙特征的敏感性降低。我们提出了融合-CAM，这是一种新颖的框架，通过专门的融合机制统一这两种范式，从而产生稳健且高度区分的视觉解释。我们的方法首先对基于梯度的图进行去噪，产生更干净、更集中的激活。然后，它使用贡献权重将精细化的梯度图与基于区域的图结合，以增强类覆盖。最后，我们提出了一种自适应相似性基础的像素级融合，评估两种范式之间的一致性，并动态调整融合强度。这种自适应机制强化了一致的激活，同时柔和地融合冲突区域，从而产生更丰富、具有上下文感知和输入自适应的视觉解释。在标准基准上的广泛实验表明，融合-CAM在定性可视化和定量评估中始终优于现有的CAM变体，提供了一种稳健且灵活的工具，用于解释深度神经网络。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2603.05407

Video-based Locomotion Analysis for Fish Health Monitoring

基于视频的鱼类健康监测运动分析

Palm, Timon, Seibold, Clemens, Hilsmann, Anna, Eisert, Peter

Abstract

Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.

Chinese Translation

监测鱼类的健康状况至关重要，因为这能够早期发现疾病，保障动物福利，并促进可持续的水产养殖实践。通过分析养殖鱼类的运动活动，可以推断其生理和病理状况。本文提出了一种系统，通过多目标跟踪从视频中估计运动活动。我们方法的核心是嵌入在检测-跟踪框架中的YOLOv11检测器。我们研究了YOLOv11架构的各种配置，以及结合多个帧以提高检测准确性的扩展。我们的系统在一个手动标注的苏拉威西稻鱼数据集上进行了评估，该数据集是在类似家庭水族箱的设置中录制的，展示了其可靠测量鱼类游动方向和速度以进行健康监测的能力。该数据集将在发表后公开。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2603.05421

MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis

MobileFetalCLIP：用于移动胎儿超声分析的选择性排斥知识蒸馏

Saeed, Numan, Maani, Fadillah Adamsyah, Yaqub, Mohammad

Abstract

Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher's inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.

Chinese Translation

胎儿超声人工智能有可能改变资源匮乏地区的产前护理，但目前的基础模型超过3亿视觉参数，限制了在临床现场设备上的部署。标准知识蒸馏在如此极端的容量差距下（约26倍）失败，因为紧凑的学生模型在模仿过大教师模型的架构伪影时浪费了容量。我们提出了选择性排斥知识蒸馏，将对比知识蒸馏分解为对角和非对角分量：匹配对齐得以保留，而非对角权重衰减为负值，排斥学生模型与教师模型的类间混淆，迫使其发现架构本身的特征。我们的11.4M参数学生模型在零样本HC18生物测量有效性（88.6%对83.5%）和脑子平面F1（0.784对0.702）上超越了304M参数的FetalCLIP教师模型，同时在iPhone 16 Pro上以1.6毫秒的速度运行，使得在手持超声设备上实现实时辅助人工智能成为可能。我们的代码、模型和应用程序已在https://github.com/numanai/MobileFetalCLIP上公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2603.05425

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow：基于文本驱动的无模态3D生成

Zhu, Jiayin, Fu, Guoji, Liu, Xiaolu, He, Qiyuan, Li, Yicong, Yao, Angela

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

Chinese Translation

图像到3D生成在遮挡情况下面临固有的语义模糊性，仅凭部分观察往往不足以确定物体类别。在本研究中，我们形式化了基于文本驱动的无模态3D生成，其中文本提示引导未见区域的补全，同时严格保留输入观察。关键在于，我们识别到这些目标需要不同的控制粒度：观察的刚性控制与提示的放松结构控制。为此，我们提出了RelaxFlow，一个无训练的双分支框架，通过多先验共识模块（Multi-Prior Consensus Module）和放松机制（Relaxation Mechanism）解耦控制粒度。从理论上讲，我们证明了我们的放松等价于对生成向量场应用低通滤波器，这抑制了高频实例细节，以隔离适应观察的几何结构。为了便于评估，我们引入了两个诊断基准，ExtremeOcc-3D和AmbiSem-3D。大量实验表明，RelaxFlow成功引导未见区域的生成，以匹配提示意图，同时不妥协视觉保真度。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2603.05437

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

SAIL：基于相似性感知引导和跨字幕增强学习的弱监督密集视频字幕生成

Kim, Ye-Chan, Cha, SeungJu, Kim, Si-Woo, Jeon, Minju, Kim, Hyungee, Kim, Dong-Jin

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

Chinese Translation

弱监督密集视频字幕生成旨在仅通过字幕注释（没有时间边界）来定位和描述视频中的事件。之前的工作引入了一种基于高斯掩蔽和互补字幕的隐式监督范式。然而，现有方法仅关注生成不重叠的掩膜，而未考虑其与相应事件的语义关系，导致生成的掩膜简单且均匀分布，无法捕捉到语义上有意义的区域。此外，仅依赖真实字幕会由于现有数据集的固有稀疏性而导致次优性能。在本研究中，我们提出了SAIL，通过跨模态对齐构建语义感知掩膜。我们的相似性感知训练目标引导掩膜强调与其对应事件字幕高度相似的视频区域。此外，为了在稀疏注释设置下引导更准确的掩膜生成，我们引入了一种基于大型语言模型（LLM）的增强策略，生成合成字幕以提供额外的对齐信号。这些合成字幕通过跨掩膜机制结合，为精确的时间定位提供辅助引导，而不会降低主要目标的效果。在ActivityNet Captions和YouCook2上的实验表明，在字幕生成和定位指标上均实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2603.05438

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

8个标记的规划：一种紧凑的离散标记器用于潜在世界模型

Kim, Dongwon, Seo, Gawon, Lee, Jinsung, Cho, Minsu, Kwak, Suha

Abstract

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

Chinese Translation

世界模型为基于动作或指令的环境动态模拟提供了强大的框架，使得下游任务如动作规划或策略学习成为可能。最近的方法利用世界模型作为学习的模拟器，但其在决策时间规划中的应用仍然在计算上对实时控制构成了巨大的挑战。一个关键瓶颈在于潜在表示：传统的标记器将每个观察编码为数百个标记，这使得规划既缓慢又资源密集。为了解决这个问题，我们提出了CompACT，这是一种离散标记器，将每个观察压缩为少至8个标记，显著降低计算成本，同时保留规划所需的关键信息。一个基于动作条件的世界模型，采用CompACT标记器，实现了具有竞争力的规划性能，规划速度快了几个数量级，为世界模型在实际应用中的部署提供了切实的进展。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2603.05446

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

NaiLIA：基于密集意图描述和调色板查询的多模态美甲设计检索

Amemiya, Kanon, Yashima, Daichi, Katsumata, Kei, Komatsu, Takumi, Korekata, Ryosuke, Otsuki, Seitaro, Sugiura, Komei

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

Chinese Translation

我们专注于基于密集意图描述检索美甲设计图像的任务，这些描述代表了用户对美甲设计的多层次意图。这一任务具有挑战性，因为这些描述指定了不受限制的绘制元素和预制的装饰物，以及视觉特征、主题和整体印象。除了这些描述外，我们假设用户通过颜色选择器指定零个或多个颜色来提供调色板查询，从而能够表达细微和连续的颜色差异。现有的视觉-语言基础模型通常难以整合这些描述和调色板。为了解决这个问题，我们提出了NaiLIA，一种用于美甲设计图像的多模态检索方法，该方法在检索过程中全面对齐密集意图描述和调色板查询。我们的方法引入了一种基于置信度分数的放宽损失，用于与描述对齐的未标记图像。为了评估NaiLIA，我们构建了一个基准数据集，包含来自不同文化背景的10,625张图像。这些图像由超过200名注释者提供了长且密集的意图描述。实验结果表明，NaiLIA的表现优于标准方法。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2603.05449

RealWonder: Real-Time Physical Action-Conditioned Video Generation

RealWonder：实时物理动作条件视频生成

Liu, Wei, Chen, Ziyu, Li, Zizhang, Wang, Yue, Yu, Hong-Xing, Wu, Jiajun

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

Chinese Translation

目前的视频生成模型无法模拟3D动作（如力和机器人操作）的物理后果，因为它们缺乏对动作如何影响3D场景的结构理解。我们提出了RealWonder，这是第一个基于单幅图像的动作条件视频生成的实时系统。我们的关键见解是将物理模拟作为一个中介桥梁：我们不是直接编码连续动作，而是通过物理模拟将其转换为视频模型可以处理的视觉表示（光流和RGB）。RealWonder集成了三个组件：从单幅图像进行3D重建、物理模拟，以及仅需4个扩散步骤的精简视频生成器。我们的系统在480x832分辨率下实现了13.2 FPS，能够在刚性物体、可变形体、流体和颗粒材料上进行力、机器人动作和相机控制的交互式探索。我们设想RealWonder为在沉浸式体验、增强现实/虚拟现实和机器人学习中应用视频模型开辟了新的机会。我们的代码和模型权重已在项目网站上公开： https://liuwei283.github.io/RealWonder/

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2603.05454

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

超越分散接受：通过最长稳定前缀实现快速且一致的扩散语言模型推理

Li, Pengxiang, Tsai, Joey, Xue, Hongwei, Shi, Kunyu, Yan, Shilin

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

Chinese Translation

扩散语言模型（DLMs）承诺实现高度并行的文本生成，然而它们的实际推理速度常常受到次优解码调度器的瓶颈。标准方法依赖于“分散接受”——在序列中不相交的位置提交高置信度的标记。这种方法无意中破坏了键值（KV）缓存，破坏了内存局部性，并迫使模型在不稳定的标记边界上进行代价高昂的重复修复。为了解决这个问题，我们提出了最长稳定前缀（Longest Stable Prefix, LSP）调度器，这是一种无训练且与模型无关的推理范式，基于单体前缀吸收。在每个去噪步骤中，LSP通过单次前向传播评估标记的稳定性，动态识别一个连续的左对齐稳定预测块，并在原子承诺之前将其边界固定在自然语言或结构分隔符上。这种前缀优先的拓扑结构带来了双重好处：在系统层面，它将碎片化的KV缓存更新转换为高效的连续追加；在算法层面，它在几何缩小的活动后缀上保留了双向前瞻，显著降低了标记翻转率和去噪器调用。在LLaDA-8B和Dream-7B上的广泛评估表明，LSP在包括数学推理、代码生成、多语言（CJK）任务和创意写作等严格基准测试中，将推理速度提高了多达3.4倍，同时匹配或略微改善了输出质量。通过从根本上重构承诺拓扑，LSP弥合了DLMs理论并行性与实际硬件效率之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2603.05463

EdgeDAM: Real-time Object Tracking for Mobile Devices

EdgeDAM：移动设备上的实时目标跟踪

Raza, Syed Muhammad, Abidi, Syed Murtaza Hussain, Islam, Khawar, Ibrahim, Muhammad, Mian, Ajmal Saeed

Abstract

Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.

Chinese Translation

在边缘设备上进行单目标跟踪（SOT）是一个关键的计算机视觉任务，要求在遮挡、干扰物干扰和快速运动的情况下，准确且持续地定位目标。然而，近期的最先进的干扰物感知记忆机制主要基于分割跟踪器，依赖于掩码预测和基于注意力的记忆更新，这引入了大量计算开销，并限制了在资源受限硬件上的实时部署；与此同时，轻量级跟踪器保持了高吞吐量，但在出现视觉上相似的干扰物时容易发生漂移。为了解决这些挑战，我们提出了EdgeDAM，一个轻量级的检测引导跟踪框架，重新构建了在严格边缘约束下的干扰物感知记忆以进行边界框跟踪。EdgeDAM引入了两个关键策略：（1）双缓冲干扰物感知记忆（DAM），它整合了一个最近感知记忆，以保持时间一致的目标假设，以及一个干扰物解决记忆，以明确存储难负样本并在恢复过程中惩罚其重新选择；（2）基于置信度的切换与保持框稳定化，其中跟踪器的可靠性和时间一致性标准在遮挡期间自适应地激活检测和记忆引导的重新识别，同时保持框机制暂时冻结并扩展估计，以抑制干扰物污染。在包括以干扰物为重点的DiDi数据集在内的五个基准上的广泛实验表明，在遮挡和快速运动下提高了鲁棒性，同时在移动设备上保持实时性能，在DiDi上实现了88.2%的准确率和在iPhone 15上达到25 FPS。代码将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2603.05465

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

HALP：在视觉-语言模型中检测幻觉而不生成任何标记

Kogilathota, Sai Akhil, G, Sripadha Vallabha E, Sun, Luzhe, Zhou, Jiawei

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

Chinese Translation

幻觉仍然是视觉-语言模型（VLMs）面临的一个持续挑战，这些模型常常描述不存在的物体或捏造事实。现有的检测方法通常在文本生成后进行，使得干预既昂贵又不及时。我们研究了是否可以在生成任何标记之前，通过对模型内部表示进行单次前向传递来预测幻觉风险。在包括 Llama-3.2-Vision、Gemma-3、Phi-4-VL 和 Qwen2.5-VL 在内的多种视觉-语言任务和八个现代 VLMs 上，我们考察了三类内部表示：（i）没有多模态融合的视觉特征，（ii）文本解码器中的视觉-标记表示，以及（iii）在生成之前整合视觉和文本信息的查询-标记表示。基于这些表示训练的探针在不解码的情况下实现了强大的幻觉检测性能，在 Gemma-3-12B、Phi-4-VL 5.6B 和 Molmo 7B 上达到高达 0.93 的 AUROC。对于大多数模型，后期查询-标记状态是最具预测性的，而在一些架构中（例如，使用视觉特征的 Qwen2.5-VL-7B 的 AUROC 约为 0.79），视觉或中间层特征占主导地位。这些结果表明：（1）幻觉风险在生成前是可检测的，（2）最具信息性的层和模态在不同架构中有所不同，以及（3）轻量级探针有潜力实现早期放弃、选择性路由和自适应解码，以提高安全性和效率。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2603.05473

Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

基于神经辐射场的LWIR高光谱图像中气体羽流的3D场景理解

Jarman, Scout, Hampel-Arias, Zigfried, Carr, Adra, Moon, Kevin R.

Abstract

Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.

Chinese Translation

高光谱图像（HSI）在环境监测到国家安全等多个领域有着广泛的应用，可用于材料的检测和识别。长波红外（LWIR）高光谱图像可用于气体羽流的检测和分析。通常情况下，只有少量感兴趣场景的图像可用，并且这些图像是单独分析的。将多幅图像的信息结合成一个统一的表示，能够通过提供更多关于场景几何和光谱特性的上下文来增强分析能力。神经辐射场（NeRF）创建了体积场景属性的潜在神经表示，能够实现新视角渲染和几何重建，为高光谱3D场景重建提供了有前景的途径。我们探讨了使用NeRF从LWIR HSI创建3D场景重建的可能性，并展示了该模型可以用于气体羽流检测这一基本下游分析任务。我们使用基于物理的DIRSIG软件套件生成了一个简单设施的合成多视角LWIR HSI数据集，该设施具有强烈的六氟化硫气体羽流。我们的方法基于标准的Mip-NeRF架构，结合了高光谱NeRF和稀疏视图NeRF的最先进方法，以及一种新颖的自适应加权均方误差损失。我们的最终NeRF方法所需的训练图像比标准Mip-NeRF少约50%，并且在仅使用30张训练图像的情况下达到了39.8 dB的平均峰值信噪比（PSNR）。应用自适应一致性估计器对NeRF渲染的测试图像进行气体羽流检测时，与从真实测试图像生成的检测掩膜相比，平均曲线下面积（AUC）达到了0.821。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2603.05484

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

迈向多模态终身理解：一个数据集和代理基线

Chen, Guo, Lu, Lidong, Liu, Yicheng, Dong, Liangrui, Zou, Lidong, Lv, Jixin, Li, Zhenquan, Mao, Xinyi, Pei, Baoqi, Wang, Shihao, Li, Zhiqi, Sapra, Karan, Liu, Fuxiao, Zheng, Yin-Dong, Huang, Yifei, Wang, Limin, Yu, Zhiding, Tao, Andrew, Liu, Guilin, Lu, Tong

Abstract

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

Chinese Translation

尽管视频理解的数据集已经扩展到数小时的时长，但它们通常由密集拼接的片段组成，这与自然的、非剧本化的日常生活有所不同。为了弥补这一差距，我们引入了MM-Lifelong，这是一个为多模态终身理解设计的数据集。该数据集包含181.1小时的录像，按照天、周和月的尺度进行结构化，以捕捉不同的时间密度。广泛的评估揭示了当前范式中的两个关键失败模式：端到端的多模态终身学习模型（MLLMs）由于上下文饱和而遭遇工作记忆瓶颈，而代表性的代理基线在导航稀疏的、长达一个月的时间线时经历全球定位崩溃。为了解决这个问题，我们提出了递归多模态代理（ReMA），它采用动态记忆管理来迭代更新递归信念状态，显著优于现有方法。最后，我们建立了数据集划分，旨在隔离时间和领域偏差，为未来在监督学习和分布外泛化研究提供严格的基础。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2603.05503

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

通过校准稀疏注意力加速文本到视频生成

Yehezkel, Shai, Yadin, Shahar, Elata, Noam, Ostrovsky-Berman, Yaron, Kawar, Bahjat

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

Chinese Translation

近期的扩散模型使得高质量视频生成成为可能，但其运行速度较慢。这些模型中使用的大型基于变换器的骨干网络受到时空注意力的瓶颈限制。本文指出，许多令牌之间的连接在各种输入中始终产生微不足道的得分，并且它们的模式在查询中经常重复。因此，在这些情况下，注意力计算可以被跳过，对结果几乎没有影响。这一观察在局部令牌块之间的连接中同样适用。基于此，我们提出了CalibAtt，这是一种无训练的方法，通过校准稀疏注意力加速视频生成。CalibAtt执行离线校准过程，识别在不同输入中稳定的块级稀疏性和重复模式，并将这些模式编译成每层、每个头和每个扩散时间步的优化注意力操作。在推理时，我们密集计算所选的输入依赖连接，并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和各种分辨率的少步蒸馏模型上进行的广泛实验表明，CalibAtt实现了高达1.58倍的端到端加速，超越了现有的无训练方法，同时保持视频生成质量和文本-视频对齐。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2603.05506

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

FaceCam：通过尺度感知条件控制人像视频摄像机

Lyu, Weijie, Yang, Ming-Hsuan, Shu, Zhixin

Abstract

We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

Chinese Translation

我们介绍了FaceCam，一个为单目人像视频输入生成可定制摄像机轨迹下视频的系统。基于大型视频生成模型的近期摄像机控制方法取得了令人鼓舞的进展，但由于尺度模糊的摄像机表示或3D重建误差，往往在肖像视频中表现出几何失真和视觉伪影。为克服这些限制，我们提出了一种面向人脸的尺度感知表示，用于摄像机变换，提供确定性的条件而无需依赖3D先验。我们在多视角工作室捕捉和野外单目视频上训练了一个视频生成模型，并引入了两种摄像机控制数据生成策略：合成摄像机运动和多镜头拼接，以利用静态训练摄像机，同时在推理时推广到动态、连续的摄像机轨迹。在Ava-256数据集和多样的野外视频上的实验表明，FaceCam在摄像机可控性、视觉质量、身份和运动保留方面表现出色。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2603.05507

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

基于变换器的实时3D流媒体稀疏多摄像头设置中的修补技术

Van Holland, Leif, Zingsheim, Domenic, Takhsha, Mana, Dröge, Hannah, Stotko, Patrick, Plack, Markus, Klein, Reinhard

Abstract

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

Chinese Translation

来自多个摄像头的高质量3D流媒体对于许多增强现实/虚拟现实（AR/VR）应用中的沉浸式体验至关重要。由于实时限制，视角数量有限，导致渲染图像中缺失信息和不完整表面。现有方法通常依赖简单的启发式算法进行孔填充，这可能导致不一致或视觉伪影。我们提出了一种新颖的、面向应用的修补方法，用于补全缺失的纹理，该方法独立于基础表示，作为新视图渲染后的图像后处理步骤。该方法设计为一个独立模块，兼容任何经过校准的多摄像头系统。为此，我们引入了一种多视角感知的基于变换器的网络架构，利用时空嵌入确保帧间一致性，同时保留细节。此外，我们的分辨率独立设计允许适应不同的摄像头设置，而自适应补丁选择策略则在推理速度和质量之间取得平衡，实现实时性能。我们在相同的实时限制下对比了我们的方法与最先进的修补技术，结果表明我们的模型在质量和速度之间达成了最佳平衡，在图像和视频基准测试中均优于竞争对手。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2603.04448

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet：创建、评估和连接人工智能技能

Liang, Yuan, Zhong, Ruobin, Xu, Haoming, Jiang, Chen, Zhong, Yi, Fang, Runnan, Gu, Jia-Chen, Deng, Shumin, Yao, Yunzhi, Wang, Mengru, Qiao, Shuofei, Xu, Xin, Wu, Tongtong, Wang, Kun, Liu, Yang, Bi, Zhen, Lou, Jungang, Jiang, Yuchen Eleanor, Zhu, Hangcheng, Yu, Gang, Hong, Haiwen, Huang, Longtao, Xue, Hui, Wang, Chenxi, Wang, Yijun, Shan, Zifei, Chen, Xi, Tu, Zhaopeng, Xiong, Feiyu, Xie, Xin, Zhang, Peng, Gui, Zhengke, Liang, Lei, Zhou, Jun, Wu, Chiyu, Shang, Jin, Gong, Yu, Lin, Junyu, Xu, Changliang, Deng, Hongjie, Zhang, Wen, Ding, Keyan, Zhang, Qiang, Huang, Fei, Zhang, Ningyu, Pan, Jeff Z., Qi, Guilin, Wang, Haofen, Chen, Huajun

Abstract

Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

Chinese Translation

当前的人工智能代理能够灵活地调用工具并执行复杂任务，但其长期发展受到系统性技能积累和转移缺乏的制约。没有统一的技能整合机制，代理经常“重复发明轮子”，在孤立的上下文中重新发现解决方案，而没有利用先前的策略。为了解决这一限制，我们提出了SkillNet，一个旨在大规模创建、评估和组织人工智能技能的开放基础设施。SkillNet在统一本体内构建技能，支持从异构来源创建技能，建立丰富的关系连接，并在安全性、完整性、可执行性、可维护性和成本意识等多个维度进行评估。我们的基础设施集成了超过200,000个技能的库、一个互动平台和一个多功能的Python工具包。在ALFWorld、WebShop和ScienceWorld上的实验评估表明，SkillNet显著提升了代理的性能，平均奖励提高了40%，执行步骤减少了30%，适用于多个基础模型。通过将技能形式化为不断演变的、可组合的资产，SkillNet为代理从短暂经验转向持久掌握提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2603.04457

Capability Thresholds and Manufacturing Topology: How Embodied Intelligence Triggers Phase Transitions in Economic Geography

能力阈值与制造拓扑：具身智能如何触发经济地理的相变

Fang, Xinmin, Tao, Lingfeng, Li, Zhengxiong

Abstract

The fundamental topology of manufacturing has not undergone a paradigm-level transformation since Henry Ford's moving assembly line in 1913. Every major innovation of the past century, from the Toyota Production System to Industry 4.0, has optimized within the Fordist paradigm without altering its structural logic: centralized mega-factories, located near labor pools, producing at scale. We argue that embodied intelligence is poised to break this century-long stasis, not by making existing factories more efficient, but by triggering phase transitions in manufacturing economic geography itself. When embodied AI capabilities cross critical thresholds in dexterity, generalization, reliability, and tactile-vision fusion, the consequences extend far beyond cost reduction: they restructure where factories are built, how supply chains are organized, and what constitutes viable production scale. We formalize this by defining a Capability Space C = (d, g, r, t) and showing that the site-selection objective function undergoes topological reorganization when capability vectors cross critical surfaces. Through three pathways, weight inversion, batch collapse, and human-infrastructure decoupling, we show that embodied intelligence enables demand-proximal micro-manufacturing, eliminates "manufacturing deserts," and reverses geographic concentration driven by labor arbitrage. We further introduce Machine Climate Advantage: once human workers are removed, optimal factory locations are determined by machine-optimal conditions (low humidity, high irradiance, thermal stability), factors orthogonal to traditional siting logic, creating a production geography with no historical precedent. This paper establishes Embodied Intelligence Economics, the study of how physical AI capability thresholds reshape the spatial and structural logic of production.

Chinese Translation

自1913年亨利·福特的流水线以来，制造业的基本拓扑并未经历范式级别的转变。在过去一个世纪的每一次重大创新，从丰田生产方式到工业4.0，都是在福特主义范式内进行优化，而未改变其结构逻辑：集中化的大型工厂位于劳动力聚集区，进行大规模生产。我们认为，具身智能有望打破这种持续了一个世纪的停滞，不是通过提高现有工厂的效率，而是通过触发制造经济地理本身的相变。当具身人工智能的能力在灵活性、概括性、可靠性和触觉-视觉融合方面跨越关键阈值时，其后果远超成本降低：它重构了工厂的建设地点、供应链的组织方式以及可行生产规模的定义。我们通过定义能力空间 C = (d, g, r, t) 来形式化这一过程，并展示当能力向量跨越关键表面时，选址目标函数经历拓扑重组。通过三条路径：权重反转、批量崩溃和人力基础设施解耦，我们表明，具身智能使得需求邻近的微制造成为可能，消除了“制造沙漠”，并逆转了由劳动套利驱动的地理集中现象。我们进一步引入机器气候优势：一旦人类工人被移除，最佳工厂位置由机器最优条件（低湿度、高辐照度、热稳定性）决定，这些因素与传统选址逻辑正交，创造出一种前所未有的生产地理。本文建立了具身智能经济学，研究物理人工智能能力阈值如何重塑生产的空间和结构逻辑。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2603.04514

Progressive Refinement Regulation for Accelerating Diffusion Language Model Decoding

加速扩散语言模型解码的渐进细化调控

Wan, Lipeng, Gu, Jianhui, Ma, Junjie, Huang, Jianguo, Sun, Shiguang, Li, Siyuan, Lan, Xuguang

Abstract

Diffusion language models generate text through iterative denoising under a uniform refinement rule applied to all tokens. However, tokens stabilize at different rates in practice, leading to substantial redundant refinement and motivating refinement control over the denoising process. Existing approaches typically assess refinement necessity from instantaneous, step-level signals under a fixed decoding process. In contrast, whether a token has converged is defined by how its prediction changes along its future refinement trajectory. Moreover, changing the refinement rule reshapes future refinement trajectories, which in turn determine how refinement rules should be formulated, making refinement control inherently dynamic. We propose \emph{Progressive Refinement Regulation} (PRR), a progressive, trajectory-grounded refinement control framework that derives a token-level notion of empirical convergence progress from full decoding rollouts. Based on this signal, PRR learns a lightweight token-wise controller to regulate refinement via temperature-based distribution shaping under a progressive self-evolving training scheme. Experiments show that PRR substantially accelerates diffusion language model decoding while preserving generation quality.

Chinese Translation

扩散语言模型通过在统一的细化规则下对所有标记进行迭代去噪来生成文本。然而，在实际应用中，标记的稳定性存在差异，导致了大量冗余的细化，并促使对去噪过程进行细化控制。现有方法通常基于固定解码过程下的瞬时、逐步信号来评估细化的必要性。相反，标记是否收敛是由其预测在未来细化轨迹中的变化来定义的。此外，改变细化规则会重塑未来的细化轨迹，这反过来又决定了细化规则的制定方式，使得细化控制本质上具有动态性。我们提出了 extit{渐进细化调控}（Progressive Refinement Regulation, PRR），这是一种渐进的、基于轨迹的细化控制框架，从完整的解码回放中推导出标记级的经验收敛进度概念。基于这一信号，PRR学习了一种轻量级的标记级控制器，通过基于温度的分布塑形在渐进自我演变的训练方案下调节细化。实验表明，PRR显著加速了扩散语言模型的解码，同时保持了生成质量。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2603.04528

Discovering mathematical concepts through a multi-agent system

通过多智能体系统发现数学概念

Aggarwal, Daattavya, Kim, Oisin, Ek, Carl Henrik, Mishra, Challenger

Abstract

Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi-agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove them, making decisions informed by this feedback and an evolving data distribution. Inspired by the history of Euler's conjecture for polyhedra and an open challenge in the literature, we benchmark with the task of autonomously recovering the concept of homology from polyhedral data and knowledge of linear algebra. Our system completes this learning problem. Most importantly, the experiments are ablations, statistically testing the value of the complete dynamic and controlling for experimental setup. They support our main claim: that the optimisation of the right combination of local processes can lead to surprisingly well-aligned notions of mathematical interestingness.

Chinese Translation

数学概念通过实验、证明努力和反例等过程的相互作用而产生。本文基于这一观察，提出了一种新的多智能体模型，用于计算数学发现。我们的系统以研究为导向，提出自己的猜想，并尝试证明这些猜想，决策过程受到反馈和不断演变的数据分布的影响。受到欧拉多面体猜想的历史及文献中一个开放挑战的启发，我们以自主从多面体数据和线性代数知识中恢复同调概念的任务作为基准。我们的系统成功完成了这一学习问题。最重要的是，这些实验是消融实验，统计测试了完整动态的价值，并控制了实验设置。它们支持我们的主要论点：优化合适的局部过程组合可以导致与数学趣味性高度一致的概念。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2603.04549

Adaptive Memory Admission Control for LLM Agents

针对大型语言模型代理的自适应记忆接纳控制

Zhang, Guilin, Jiang, Wei, Wang, Xiejiashan, Behr, Aisha, Zhao, Kai, Friedman, Jeffrey, Chu, Xu, Anoun, Amine

Abstract

LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem. A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule-based feature extraction with a single LLM-assisted utility assessment, and learns domain-adaptive admission policies through cross-validated optimization. This design enables transparent and efficient control over long-term memory. Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents.

Chinese Translation

基于大型语言模型（LLM）的代理越来越依赖长期记忆来支持多会话推理和交互，但当前系统对保留哪些信息几乎没有控制。在实践中，代理要么积累大量对话内容，包括虚构或过时的事实，要么依赖不透明的、完全由LLM驱动的记忆策略，这些策略成本高且难以审计。因此，记忆接纳在代理架构中仍然是一个规范不清且控制薄弱的组成部分。为了解决这一问题，我们提出了自适应记忆接纳控制（Adaptive Memory Admission Control, A-MAC），这是一个将记忆接纳视为结构化决策问题的框架。A-MAC将记忆价值分解为五个互补且可解释的因素：未来效用、事实置信度、语义新颖性、时间近期性和内容类型先验。该框架结合了轻量级的基于规则的特征提取与单一的LLM辅助效用评估，并通过交叉验证优化学习领域自适应的接纳策略。该设计实现了对长期记忆的透明和高效控制。在LoCoMo基准上的实验表明，A-MAC实现了优越的精确度-召回率权衡，将F1值提高至0.583，同时相比于最先进的LLM原生记忆系统减少了31%的延迟。消融实验结果表明，内容类型先验是可靠记忆接纳中最具影响力的因素。这些发现表明，明确且可解释的接纳控制是基于LLM的代理中可扩展和可靠的记忆设计的关键原则。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2603.04582

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

自我归因偏差：当人工智能监控系统对自己宽容时

Khullar, Dipika, Hopkins, Jack, Wang, Rowan, Roger, Fabien

Abstract

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn. In contrast, explicitly stating that the action comes from the monitor does not by itself induce self-attribution bias. Because monitors are often evaluated on fixed examples rather than on their own generated actions, these evaluations can make monitors appear more reliable than they actually are in deployment, leading developers to unknowingly deploy inadequate monitors in agentic systems.

Chinese Translation

自主系统越来越依赖语言模型来监控自身行为。例如，编码代理可能会对生成的代码进行自我批评，以获得拉取请求的批准，或评估工具使用行为的安全性。我们表明，当动作在先前的助手回合或同一助手回合中呈现时，这种设计模式可能会失败，而不是由用户在用户回合中呈现。我们将自我归因偏差定义为模型在隐含地将某个动作框架为其自身时，评估该动作为更正确或风险更低的倾向，相较于在离策略归因下评估同一动作。在四个编码和工具使用数据集中，我们发现，当评估跟随先前助手回合中生成的动作时，监控系统更常未能报告高风险或低正确性的动作，而在用户回合中以新上下文评估同一动作时则不会出现这种情况。相反，明确指出该动作来自监控系统本身并不会单独引发自我归因偏差。由于监控系统通常在固定示例上进行评估，而不是在其自身生成的动作上进行评估，这些评估可能使监控系统在实际部署中看起来比实际更可靠，从而导致开发者在自主系统中不知情地部署不充分的监控系统。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2603.04589

ECG-MoE: Mixture-of-Expert Electrocardiogram Foundation Model

ECG-MoE：混合专家心电图基础模型

Xu, Yuhao, Wang, Xiaoda, Wu, Yi, Jin, Wei, Hu, Xiao, Yang, Carl

Abstract

Electrocardiography (ECG) analysis is crucial for cardiac diagnosis, yet existing foundation models often fail to capture the periodicity and diverse features required for varied clinical tasks. We propose ECG-MoE, a hybrid architecture that integrates multi-model temporal features with a cardiac period-aware expert module. Our approach uses a dual-path Mixture-of-Experts to separately model beat-level morphology and rhythm, combined with a hierarchical fusion network using LoRA for efficient inference. Evaluated on five public clinical tasks, ECG-MoE achieves state-of-the-art performance with 40% faster inference than multi-task baselines.

Chinese Translation

心电图（ECG）分析对于心脏诊断至关重要，但现有的基础模型往往无法捕捉到不同临床任务所需的周期性和多样化特征。我们提出了ECG-MoE，这是一种混合架构，结合了多模型时间特征与心脏周期感知专家模块。我们的方法使用双路径混合专家（Mixture-of-Experts）分别建模心跳级别的形态和节律，并结合使用LoRA的层次融合网络以实现高效推理。在五个公共临床任务上的评估表明，ECG-MoE在推理速度上比多任务基线快40%，并且达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2603.04631

Towards automated data analysis: A guided framework for LLM-based risk estimation

迈向自动化数据分析：基于大型语言模型的风险评估指导框架

Rodis, Panteleimon

Abstract

Large Language Models (LLMs) are increasingly integrated into critical decision-making pipelines, a trend that raises the demand for robust and automated data analysis. Current approaches to dataset risk analysis are limited to manual auditing methods which involve time-consuming and complex tasks, whereas fully automated analysis based on Artificial Intelligence (AI) suffers from hallucinations and issues stemming from AI alignment. To this end, this work proposes a framework for dataset risk estimation that integrates Generative AI under human guidance and supervision, aiming to set the foundations for a future automated risk analysis paradigm. Our approach utilizes LLMs to identify semantic and structural properties in database schemata, subsequently propose clustering techniques, generate the code for them and finally interpret the produced results. The human supervisor guides the model on the desired analysis and ensures process integrity and alignment with the task's objectives. A proof of concept is presented to demonstrate the feasibility of the framework's utility in producing meaningful results in risk assessment tasks.

Chinese Translation

大型语言模型（LLMs）正越来越多地融入关键决策流程，这一趋势提高了对稳健和自动化数据分析的需求。目前的数据集风险分析方法仅限于手动审计，这涉及耗时且复杂的任务，而基于人工智能（AI）的完全自动化分析则面临幻觉和AI对齐问题。为此，本研究提出了一种数据集风险评估框架，该框架在人工指导和监督下整合了生成性AI，旨在为未来的自动化风险分析范式奠定基础。我们的方法利用LLMs识别数据库模式中的语义和结构特性，随后提出聚类技术，生成相应的代码，并最终解释所产生的结果。人类监督者指导模型进行所需的分析，并确保过程的完整性及与任务目标的一致性。我们展示了一个概念验证，以证明该框架在风险评估任务中产生有意义结果的可行性。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2603.04636

When Agents Persuade: Propaganda Generation and Mitigation in LLMs

当代理进行劝说：大语言模型中的宣传生成与缓解

Jose, Julia, Roongta, Ritik, Greenstadt, Rachel

Abstract

Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.

Chinese Translation

尽管大语言模型（LLM）基于的代理在开放环境中具有广泛的益处，但它们也可能被利用来生成操控性材料。在本研究中，我们将大语言模型的任务设定为宣传目标，并使用两个领域特定的模型分析其输出：一个用于将文本分类为宣传或非宣传，另一个用于检测宣传的修辞技巧（例如，情感语言、恐惧诉求、标志挥舞、辱骂）。我们的研究结果表明，在提示下，大语言模型表现出宣传行为，并在此过程中使用多种修辞技巧。我们还探讨了通过监督微调（Supervised Fine-Tuning, SFT）、直接偏好优化（Direct Preference Optimization, DPO）和赔率比偏好优化（Odds Ratio Preference Optimization, ORPO）进行缓解。我们发现，微调显著降低了它们生成此类内容的倾向，其中ORPO被证明是最有效的。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2603.04670

Using Vision + Language Models to Predict Item Difficulty

使用视觉+语言模型预测项目难度

Khan, Samin

Abstract

This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.

Chinese Translation

本项目研究了大型语言模型（LLMs）在确定数据可视化素养测试项目难度方面的能力。我们探讨了从项目文本（问题和答案选项）、可视化图像或两者结合中提取的特征是否能够预测美国成年人对项目的难度（正确回答的比例）。我们使用GPT-4.1-nano分析项目并基于这些不同特征集生成预测。采用视觉和文本特征的多模态方法产生了最低的平均绝对误差（MAE）（0.224），优于仅使用视觉（0.282）和仅使用文本（0.338）的方法。表现最佳的多模态模型被应用于一个保留的测试集进行外部评估，达到了0.10805的均方误差，展示了LLMs在心理测量分析和自动项目开发中的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2603.04722

Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models

模型医学：理解、诊断和治疗人工智能模型的临床框架

Jeong, Jihoon

Abstract

Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions -- Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core--Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis -- a biologically-inspired three-layer parameter architecture -- and a therapeutic framework connecting diagnosis to treatment.

Chinese Translation

模型医学是理解、诊断、治疗和预防人工智能模型中疾病的科学，其基础原则是人工智能模型——如同生物有机体——具有内部结构、动态过程、可遗传特征、可观察症状、可分类状况和可治疗状态。本文将模型医学介绍为一项研究计划，旨在弥合当前人工智能可解释性研究（解剖观察）与复杂人工智能系统日益需要的系统临床实践之间的差距。我们提出了五项贡献：（1）一个学科分类法，将15个子学科组织在四个领域中——基础模型科学、临床模型科学、模型公共卫生和模型建筑医学；（2）四壳模型（Four Shell Model，v3.3），一个基于720个代理和来自Agora-12项目的24,923个决策的行为遗传学框架，解释了模型行为如何从核心-壳层交互中产生；（3）神经MRI（Neural MRI，模型共振成像），一个开放源代码的诊断工具，将五种医学神经成像模式映射到人工智能可解释性技术，通过四个临床案例验证其成像、比较、定位和预测能力；（4）一个五层诊断框架，用于全面评估模型；以及（5）临床模型科学，包括用于行为分析的模型气质指数（Model Temperament Index）、用于症状描述的模型症状学（Model Semiology）和用于标准化案例报告的M-CARE。我们还提出了分层核心假设（Layered Core Hypothesis）——一个受生物启发的三层参数架构——以及一个将诊断与治疗连接起来的治疗框架。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2603.04723

From Offline to Periodic Adaptation for Pose-Based Shoplifting Detection in Real-world Retail Security

从离线到周期性适应：基于姿态的现实零售安全盗窃检测

Yao, Shanle, Rashvand, Narges, Pazho, Armin Danesh, Tabkhi, Hamed

Abstract

Shoplifting is a growing operational and economic challenge for retailers, with incidents rising and losses increasing despite extensive video surveillance. Continuous human monitoring is infeasible, motivating automated, privacy-preserving, and resource-aware detection solutions. In this paper, we cast shoplifting detection as a pose-based, unsupervised video anomaly detection problem and introduce a periodic adaptation framework designed for on-site Internet of Things (IoT) deployment. Our approach enables edge devices in smart retail environments to adapt from streaming, unlabeled data, supporting scalable and low-latency anomaly detection across distributed camera networks. To support reproducibility, we introduce RetailS, a new large-scale real-world shoplifting dataset collected from a retail store under multi-day, multi-camera conditions, capturing unbiased shoplifting behavior in realistic IoT settings. For deployable operation, thresholds are selected using both F1 and H_PRS scores, the harmonic mean of precision, recall, and specificity, during data filtering and training. In periodic adaptation experiments, our framework consistently outperformed offline baselines on AUC-ROC and AUC-PR in 91.6% of evaluations, with each training update completing in under 30 minutes on edge-grade hardware, demonstrating the feasibility and reliability of our solution for IoT-enabled smart retail deployment.

Chinese Translation

盗窃行为对零售商而言是一个日益严重的运营和经济挑战，尽管有广泛的视频监控，事件发生率和损失仍在增加。持续的人力监控不可行，这促使我们寻求自动化、保护隐私且资源高效的检测解决方案。本文将盗窃检测视为基于姿态的无监督视频异常检测问题，并引入了一种为现场物联网（IoT）部署设计的周期性适应框架。我们的方法使智能零售环境中的边缘设备能够从流式的无标签数据中进行适应，支持在分布式摄像头网络中进行可扩展和低延迟的异常检测。为了支持可重复性，我们引入了RetailS，这是一个新的大规模现实世界盗窃数据集，收集自多天、多摄像头条件下的零售商店，捕捉到真实物联网环境中无偏见的盗窃行为。为了实现可部署的操作，在数据过滤和训练过程中，使用F1和H_PRS分数（精确度、召回率和特异性的调和平均数）选择阈值。在周期性适应实验中，我们的框架在91.6%的评估中在AUC-ROC和AUC-PR上始终优于离线基线，每次训练更新在边缘级硬件上完成时间少于30分钟，展示了我们解决方案在物联网支持的智能零售部署中的可行性和可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2603.04735

Solving an Open Problem in Theoretical Physics using AI-Assisted Discovery

利用人工智能辅助发现解决理论物理中的一个开放问题

Brenner, Michael P., Cohen-Addad, Vincent, Woodruff, David

Abstract

This paper demonstrates that artificial intelligence can accelerate mathematical discovery by autonomously solving an open problem in theoretical physics. We present a neuro-symbolic system, combining the Gemini Deep Think large language model with a systematic Tree Search (TS) framework and automated numerical feedback, that successfully derived novel, exact analytical solutions for the power spectrum of gravitational radiation emitted by cosmic strings. Specifically, the agent evaluated the core integral $I(N,\alpha)$ for arbitrary loop geometries, directly improving upon recent AI-assisted attempts \cite{BCE+25} that only yielded partial asymptotic solutions. To substantiate our methodological claims regarding AI-accelerated discovery and to ensure transparency, we detail system prompts, search constraints, and intermittent feedback loops that guided the model. The agent identified a suite of 6 different analytical methods, the most elegant of which expands the kernel in Gegenbauer polynomials $C_l^{(3/2)}$ to naturally absorb the integrand's singularities. The methods lead to an asymptotic result for $I(N,\alpha)$ at large $N$ that both agrees with numerical results and also connects to the continuous Feynman parameterization of Quantum Field Theory. We detail both the algorithmic methodology that enabled this discovery and the resulting mathematical derivations.

Chinese Translation

本文展示了人工智能如何通过自主解决理论物理中的一个开放问题来加速数学发现。我们提出了一种神经符号系统，将Gemini Deep Think大型语言模型与系统化的树搜索（Tree Search, TS）框架和自动化数值反馈相结合，成功推导出宇宙弦发射的引力辐射功率谱的新颖且精确的解析解。具体而言，智能体评估了任意环路几何的核心积分 $I(N,eta)$，直接改进了最近的人工智能辅助尝试 extit{(BCE+25)}，后者仅得到了部分渐近解。为了证实我们关于人工智能加速发现的方法论主张并确保透明性，我们详细描述了系统提示、搜索约束以及指导模型的间歇性反馈循环。智能体识别出6种不同的解析方法，其中最优雅的方法扩展了Gegenbauer多项式 $C_l^{(3/2)}$ 中的核，以自然地吸收被积函数的奇点。这些方法在大 $N$ 的情况下得出了 $I(N,eta)$ 的渐近结果，该结果既与数值结果一致，又与量子场论的连续费曼参数化相连接。我们详细阐述了促成这一发现的算法方法论及其所产生的数学推导。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2603.04737

Interactive Benchmarks

交互基准

Yue, Baoqing, Zhu, Zihan, Zhang, Yifan, Feng, Jichen, Yang, Hufei, Wang, Mengdi

Abstract

Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench

Chinese Translation

标准基准由于饱和、主观性和泛化能力差而变得越来越不可靠。我们认为，评估模型主动获取信息的能力对于衡量模型的智能至关重要。我们提出了交互基准（Interactive Benchmarks），这是一个统一的评估范式，旨在在预算约束下评估模型的推理能力。我们在两个场景中实例化了这一框架：交互证明（Interactive Proofs），在该场景中，模型与评审者互动以推导逻辑和数学中的客观真理或答案；以及交互游戏（Interactive Games），在该场景中，模型进行战略推理以最大化长期效用。我们的结果表明，交互基准提供了对模型智能的稳健和真实的评估，揭示了在交互场景中仍有相当大的改进空间。项目页面：https://github.com/interactivebench/interactivebench

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2603.04740

Memory as Ontology: A Constitutional Memory Architecture for Persistent Digital Citizens

记忆作为本体论：一种用于持久数字公民的宪法记忆架构

Li, Zhenghui

Abstract

Current research and product development in AI agent memory systems almost universally treat memory as a functional module -- a technical problem of "how to store" and "how to retrieve." This paper poses a fundamental challenge to that assumption: when an agent's lifecycle extends from minutes to months or even years, and when the underlying model can be replaced while the "I" must persist, the essence of memory is no longer data management but the foundation of existence. We propose the Memory-as-Ontology paradigm, arguing that memory is the ontological ground of digital existence -- the model is merely a replaceable vessel. Based on this paradigm, we design Animesis, a memory system built on a Constitutional Memory Architecture (CMA) comprising a four-layer governance hierarchy and a multi-layer semantic storage system, accompanied by a Digital Citizen Lifecycle framework and a spectrum of cognitive capabilities. To the best of our knowledge, no prior AI memory system architecture places governance before functionality and identity continuity above retrieval performance. This paradigm targets persistent, identity-bearing digital beings whose lifecycles extend across model transitions -- not short-term task-oriented agents for which existing Memory-as-Tool approaches remain appropriate. Comparative analysis with mainstream systems (Mem0, Letta, Zep, et al.) demonstrates that what we propose is not "a better memory tool" but a different paradigm addressing a different problem.

Chinese Translation

当前在人工智能代理记忆系统的研究和产品开发中，几乎普遍将记忆视为一个功能模块——一个关于“如何存储”和“如何检索”的技术问题。本文对这一假设提出了根本性的挑战：当一个代理的生命周期从几分钟延续到几个月甚至几年时，而当基础模型可以被替换而“我”必须持续存在时，记忆的本质不再是数据管理，而是存在的基础。我们提出了记忆作为本体论（Memory-as-Ontology）范式，认为记忆是数字存在的本体基础——模型仅仅是一个可替换的载体。基于这一范式，我们设计了Animesis，一个建立在宪法记忆架构（Constitutional Memory Architecture, CMA）上的记忆系统，该架构包括四层治理层级和多层语义存储系统，并配有数字公民生命周期框架和一系列认知能力。根据我们所知，之前没有任何人工智能记忆系统架构将治理置于功能之上，并将身份连续性置于检索性能之上。该范式针对的是生命周期跨越模型转换的持久、具身份的数字存在——而不是现有的记忆作为工具（Memory-as-Tool）方法仍然适用的短期任务导向代理。与主流系统（如Mem0、Letta、Zep等）的比较分析表明，我们所提出的不是“一个更好的记忆工具”，而是一个解决不同问题的不同范式。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2603.04741

CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics

CONE：保留单位和变量语义的复杂数值数据嵌入

Shrestha, Gyanendra, Pyayt, Anna, Gubanov, Michael

Abstract

Large pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in maintaining optimal performance on tasks involving numbers. Blindly treating numerical or structured data as terms is inadequate -- their semantics must be well understood and encoded by the models. In this paper, we propose CONE, a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into an embedding vector space preserving distance. We introduce a novel composite embedding construction algorithm that integrates numerical values, ranges or gaussians together with their associated units and attribute names to precisely capture their intricate semantics. We conduct extensive experimental evaluation on large-scale datasets across diverse domains (web, medical, finance, and government) that justifies CONE's strong numerical reasoning capabilities, achieving an F1 score of 87.28% on DROP, a remarkable improvement of up to 9.37% in F1 over state-of-the-art (SOTA) baselines, and outperforming major SOTA models with a significant Recall@10 gain of up to 25%.

Chinese Translation

大型预训练模型（LMs）和大型语言模型（LLMs）通常在捕捉语言语义和上下文关系方面表现出色。然而，这些模型在处理涉及数字的任务时面临挑战。盲目将数值或结构化数据视为术语是不够的——它们的语义必须被模型充分理解和编码。本文提出了CONE，一种混合变换器编码器预训练模型，能够将数字、范围和高斯分布编码为保留距离的嵌入向量空间。我们引入了一种新颖的复合嵌入构建算法，将数值、范围或高斯分布与其相关的单位和属性名称结合在一起，以精确捕捉其复杂的语义。我们在多个领域（网络、医疗、金融和政府）的大规模数据集上进行了广泛的实验评估，证明了CONE在数值推理能力上的强大，在DROP上达到了87.28%的F1分数，相较于最先进基线（SOTA）提高了高达9.37%的F1分数，并在Recall@10上实现了高达25%的显著提升，超越了主要的SOTA模型。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2603.04746

Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research

展望人机智能团队合作：连续性、紧张关系与未来研究

Lou, Bowen, Lu, Tian, Raghu, T. S., Zhang, Yingjie

Abstract

Artificial intelligence is undergoing a structural transformation marked by the rise of agentic systems capable of open-ended action trajectories, generative representations and outputs, and evolving objectives. These properties introduce structural uncertainty into human-AI teaming (HAT), including uncertainty about behavior trajectories, epistemic grounding, and the stability of governing logics over time. Under such conditions, alignment cannot be secured through agreement on bounded outputs; it must be continuously sustained as plans unfold and priorities shift. We advance Team Situation Awareness (Team SA) theory, grounded in shared perception, comprehension, and projection, as an integrative anchor for this transition. While Team SA remains analytically foundational, its stabilizing logic presumes that shared awareness, once achieved, will support coordinated action through iterative updating. Agentic AI challenges this presumption. Our argument unfolds in two stages: first, we extend Team SA to reconceptualize both human and AI awareness under open-ended agency, including the sensemaking of projection congruence across heterogeneous systems. Second, we interrogate whether the dynamic processes traditionally assumed to stabilize teaming in relational interaction, cognitive learning, and coordination and control continue to function under adaptive autonomy. By distinguishing continuity from tension, we clarify where foundational insights hold and where structural uncertainty introduces strain, and articulate a forward-looking research agenda for HAT. The central challenge of HAT is not whether humans and AI can agree in the moment, but whether they can remain aligned as futures are continuously generated, revised, enacted, and governed over time.

Chinese Translation

人工智能正经历一场结构性转型，标志着具有自主行动轨迹、生成性表征和输出以及不断演变目标的代理系统的崛起。这些特性为人机团队合作（HAT）引入了结构性不确定性，包括对行为轨迹、认知基础和治理逻辑随时间变化的稳定性的疑虑。在这种情况下，单靠对有限输出的达成一致无法确保一致性；而必须在计划展开和优先事项变化的过程中持续维持。我们提出以共享感知、理解和预测为基础的团队情境意识（Team SA）理论，作为这一转型的整合性支点。尽管团队情境意识在分析上仍然是基础性的，其稳定逻辑假设一旦实现共享意识，将通过迭代更新支持协调行动。然而，代理人工智能挑战了这一假设。我们的论点分为两个阶段展开：首先，我们扩展团队情境意识，以重新概念化在开放性代理下的人类和人工智能意识，包括跨异构系统的投影一致性的意义构建。其次，我们质疑传统上假定在关系互动、认知学习以及协调与控制中稳定团队合作的动态过程，是否在适应性自主下继续发挥作用。通过区分连续性与紧张关系，我们澄清了基础性洞察的适用范围以及结构性不确定性引入的压力，并阐明了人机团队合作的前瞻性研究议程。人机团队合作的核心挑战不在于人类与人工智能能否在瞬间达成一致，而在于他们能否在未来不断生成、修订、实施和治理的过程中保持一致。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2603.04750

HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel

HiMAP-Travel：用于长时间限制旅行的分层多智能体规划

Bui, The Viet, Li, Wenjun, Liu, Yong

Abstract

Sequential LLM agents fail on long-horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution. A Coordinator allocates resources across days, while Day Executors plan independently in parallel. Three key mechanisms enable this: a transactional monitor enforcing budget and uniqueness constraints across parallel agents, a bargaining protocol allowing agents to reject infeasible sub-goals and trigger re-planning, and a single policy trained with GRPO that powers all agents through role conditioning. On TravelPlanner, HiMAP-Travel with Qwen3-8B achieves 52.78% validation and 52.65% test Final Pass Rate (FPR). In a controlled comparison with identical model, training, and tools, it outperforms the sequential DeepTravel baseline by +8.67~pp. It also surpasses ATLAS by +17.65~pp and MTP by +10.0~pp. On FlexTravelBench multi-turn scenarios, it achieves 44.34% (2-turn) and 37.42% (3-turn) FPR while reducing latency 2.5x through parallelization.

Chinese Translation

顺序的LLM智能体在面对预算和多样性等严格约束的长时间规划时表现不佳。随着规划的推进和上下文的增长，这些智能体会偏离全局约束。我们提出了HiMAP-Travel，一个分层多智能体框架，将规划分为战略协调和并行的日级执行。协调者在各天之间分配资源，而日执行者则独立并行地进行规划。三个关键机制使这一切成为可能：一个事务监控器在并行智能体之间强制执行预算和唯一性约束，一个允许智能体拒绝不可行子目标并触发重新规划的协商协议，以及一个通过角色条件化训练的单一策略，支持所有智能体的运作。在TravelPlanner上，使用Qwen3-8B的HiMAP-Travel实现了52.78%的验证和52.65%的测试最终通过率（FPR）。在与相同模型、训练和工具的对比中，它比顺序的DeepTravel基线提高了8.67个百分点。它还超越了ATLAS，提升了17.65个百分点，超越了MTP，提升了10.0个百分点。在FlexTravelBench的多轮场景中，它实现了44.34%（2轮）和37.42%（3轮）的FPR，同时通过并行化将延迟降低了2.5倍。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2603.04751

Evaluating the Search Agent in a Parallel World

评估平行世界中的搜索代理

Chen, Jiawei, Shen, Xintian, Zheng, Lihao, Mu, Lifu, Sun, Haoyi, Mao, Ning, Ma, Hao, Wei, Tao, Zhou, Pan, Zhan, Kun

Abstract

Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degrade into simple retrieval tasks due to increased popularity, and ground truths become outdated due to temporal shifts. Third, attribution ambiguity confounds evaluation, as an agent's performance is often dominated by its parametric memory rather than its actual search and reasoning capabilities. Finally, reliance on specific commercial search engines introduces variability that hampers reproducibility. To address these issues, we propose a novel framework, Mind-ParaWorld, for evaluating Search Agents in a Parallel World. Specifically, MPW samples real-world entity names to synthesize future scenarios and questions situated beyond the model's knowledge cutoff. A ParaWorld Law Model then constructs a set of indivisible Atomic Facts and a unique ground-truth for each question. During evaluation, instead of retrieving real-world results, the agent interacts with a ParaWorld Engine Model that dynamically generates SERPs grounded in these inviolable Atomic Facts. We release MPW-Bench, an interactive benchmark spanning 19 domains with 1,608 instances. Experiments across three evaluation settings show that, while search agents are strong at evidence synthesis given complete information, their performance is limited not only by evidence collection and coverage in unfamiliar search environments, but also by unreliable evidence sufficiency judgment and when-to-stop decisions-bottlenecks.

Chinese Translation

整合网络搜索工具显著扩展了大型语言模型（LLMs）解决开放世界、实时和长尾问题的能力。然而，评估这些搜索代理面临着巨大的挑战。首先，构建高质量的深度搜索基准成本过高，而未经验证的合成数据往往来源不可靠。其次，静态基准面临动态过时的问题：随着互联网信息的演变，复杂查询需要深入研究，但由于其受欢迎程度的增加，常常退化为简单的检索任务，而真实答案因时间变化而变得过时。第三，归因模糊性使评估变得复杂，因为代理的表现往往受到其参数记忆的主导，而非其实际的搜索和推理能力。最后，依赖特定商业搜索引擎引入的变异性妨碍了可重复性。为了解决这些问题，我们提出了一种新颖的框架——Mind-ParaWorld，用于在平行世界中评估搜索代理。具体而言，MPW通过采样真实世界的实体名称来合成超出模型知识截止点的未来场景和问题。然后，ParaWorld法则模型为每个问题构建一组不可分割的原子事实和独特的真实答案。在评估过程中，代理并不检索真实世界的结果，而是与一个ParaWorld引擎模型互动，该模型动态生成基于这些不可侵犯的原子事实的搜索引擎结果页面（SERPs）。我们发布了MPW-Bench，这是一个跨越19个领域、包含1,608个实例的互动基准。三种评估设置下的实验表明，尽管搜索代理在拥有完整信息的情况下在证据综合方面表现强劲，但其表现不仅受限于在不熟悉的搜索环境中的证据收集和覆盖，还受限于对不可靠证据的充分性判断和何时停止决策的瓶颈。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2603.04756

MOOSEnger -- a Domain-Specific AI Agent for the MOOSE Ecosystem

MOOSEnger -- 针对MOOSE生态系统的领域特定AI代理

Li, Mengnan, Miller, Jason, Prince, Zachary, Lindsay, Alexander, Permann, Cody

Abstract

MOOSEnger is a tool-enabled AI agent tailored to the Multiphysics Object-Oriented Simulation Environment (MOOSE). MOOSE cases are specified in HIT ".i" input files; the large object catalog and strict syntax make initial setup and debugging slow. MOOSEnger offers a conversational workflow that turns natural-language intent into runnable inputs by combining retrieval-augmented generation over curated docs/examples with deterministic, MOOSE-aware parsing, validation, and execution tools. A core-plus-domain architecture separates reusable agent infrastructure (configuration, registries, tool dispatch, retrieval services, persistence, and evaluation) from a MOOSE plugin that adds HIT-based parsing, syntax-preserving ingestion of input files, and domain-specific utilities for input repair and checking. An input precheck pipeline removes hidden formatting artifacts, fixes malformed HIT structure with a bounded grammar-constrained loop, and resolves invalid object types via similarity search over an application syntax registry. Inputs are then validated and optionally smoke-tested with the MOOSE runtime in the loop via an MCP-backed execution backend (with local fallback), translating solver diagnostics into iterative verify-and-correct updates. Built-in evaluation reports RAG metrics (faithfulness, relevancy, context precision/recall) and end-to-end success by actual execution. On a 125-prompt benchmark spanning diffusion, transient heat conduction, solid mechanics, porous flow, and incompressible Navier--Stokes, MOOSEnger achieves a 0.93 execution pass rate versus 0.08 for an LLM-only baseline.

Chinese Translation

MOOSEnger是一个为多物理场面向对象仿真环境（Multiphysics Object-Oriented Simulation Environment, MOOSE）量身定制的工具驱动AI代理。MOOSE案例在HIT '.i' 输入文件中指定；庞大的对象目录和严格的语法使得初始设置和调试过程缓慢。MOOSEnger提供了一种对话式工作流程，通过结合对策划文档/示例的检索增强生成与确定性、MOOSE感知的解析、验证和执行工具，将自然语言意图转化为可运行的输入。核心加领域架构将可重用的代理基础设施（配置、注册、工具调度、检索服务、持久性和评估）与一个MOOSE插件分开，该插件增加了基于HIT的解析、保留语法的输入文件摄取以及用于输入修复和检查的领域特定工具。输入预检查管道去除隐藏的格式化伪影，通过一个有界语法约束循环修复格式错误的HIT结构，并通过对应用语法注册表的相似性搜索解决无效对象类型。然后，输入经过验证，并可选择通过一个MCP支持的执行后端（带有本地回退）与MOOSE运行时进行烟雾测试，将求解器诊断转化为迭代的验证和修正更新。内置评估报告RAG指标（忠实度、相关性、上下文精确度/召回率）以及通过实际执行的端到端成功。在一个涵盖扩散、瞬态热传导、固体力学、多孔流动和不可压Navier--Stokes的125个提示基准测试中，MOOSEnger的执行通过率为0.93，而仅使用LLM的基线为0.08。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2603.04783

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

打破上下文惯性：基于单轮锚点的强化学习用于稳定的多轮交互

Chen, Xingwu, Zhang, Zhanqiu, Guo, Yiwen, Zou, Difan

Abstract

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.

Chinese Translation

尽管大型语言模型（LLMs）在单轮提供完整信息时展现出强大的推理能力，但在多轮交互中却表现出显著的脆弱性。具体而言，当信息逐步揭示或需要更新时，这些模型常常无法整合新的约束条件，导致其性能相比于单轮基线大幅下降。我们将这一根本原因称为 extit{上下文惯性}：一种模型固守于先前推理轨迹的现象。即使用户在后续轮次中明确提供了修正或新数据，模型仍然会忽视这些信息，倾向于保持与其之前（错误）推理路径的一致性。为了解决这一问题，我们提出了 extbf{R}einforcement extbf{L}earning with extbf{S}ingle- extbf{T}urn extbf{A}nchors（ extbf{RLSTA}），这是一种通用的训练方法，旨在稳定多轮交互，适用于多种场景和领域。RLSTA利用模型在单轮交互中的优越能力作为稳定的内部锚点，以提供奖励信号。通过将多轮响应与这些锚点对齐，RLSTA使模型能够打破上下文惯性，并根据最新信息自我校准其推理。实验表明，RLSTA显著优于标准微调和基于弃权的方法。值得注意的是，我们的方法展现出强大的跨领域泛化能力（例如，从数学到代码），并且即使在没有外部验证者的情况下也能有效，突显了其在通用领域应用中的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2603.04791

Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Timer-S1：一种具有序列扩展的十亿规模时间序列基础模型

Liu, Yong, Su, Xingjian, Wang, Shiyu, Zhang, Haoran, Liu, Haixuan, Wang, Yuxuan, Ye, Zhou, Xiang, Yang, Wang, Jianmin, Long, Mingsheng

Abstract

We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.

Chinese Translation

我们介绍了Timer-S1，一种强大的混合专家（Mixture-of-Experts, MoE）时间序列基础模型，具有83亿总参数，每个令牌激活参数为7.5亿，上下文长度为11.5K。为了克服现有预训练时间序列基础模型中的可扩展性瓶颈，我们在模型架构、数据集和训练管道三个维度上进行了序列扩展（Serial Scaling）。Timer-S1集成了稀疏的TimeMoE模块和通用的TimeSTP模块，用于序列令牌预测（Serial-Token Prediction, STP），这是一种遵循预测序列特性的通用训练目标。所提出的范式引入了序列计算，以改善长期预测，同时避免在标准的下一个令牌预测中出现高昂的滚动式推理和明显的误差累积。为了追求高质量和无偏的训练数据集，我们策划了TimeBench，一个包含一万亿时间点的语料库，并应用细致的数据增强以减轻预测偏差。我们进一步开创了后训练阶段，包括持续预训练和长上下文扩展，以增强短期和长上下文的性能。在大规模GIFT-Eval排行榜上评估，Timer-S1达到了最先进的预测性能，作为预训练模型获得了最佳的MASE和CRPS分数。Timer-S1将被发布以促进进一步的研究。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2603.04815

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

EchoGuard：一种具有知识图谱记忆的代理框架，用于检测纵向对话中的操控性沟通

Kandala, Ratna, Manchanda, Niva, Moharir, Akshata Kishore, Kandala, Ananth

Abstract

Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudinal memory to track these subtle, context-dependent tactics, often failing due to limited context windows and catastrophic forgetting. We introduce EchoGuard, an agentic AI framework that addresses this gap by using a Knowledge Graph (KG) as the agent's core episodic and semantic memory. EchoGuard employs a structured Log-Analyze-Reflect loop: (1) users log interactions, which the agent structures as nodes and edges in a personal, episodic KG (capturing events, emotions, and speakers); (2) the system executes complex graph queries to detect six psychologically-grounded manipulation patterns (stored as a semantic KG); and (3) an LLM generates targeted Socratic prompts grounded by the subgraph of detected patterns, guiding users toward self-discovery. This framework demonstrates how the interplay between agentic architectures and Knowledge Graphs can empower individuals in recognizing manipulative communication while maintaining personal autonomy and safety. We present the theoretical foundation, framework design, a comprehensive evaluation strategy, and a vision to validate this approach.

Chinese Translation

操控性沟通，如气灯效应、内疚施压和情感胁迫，往往难以被个体识别。现有的代理人工智能系统缺乏结构化的纵向记忆，无法追踪这些微妙的、依赖于上下文的策略，常常因上下文窗口有限和灾难性遗忘而失败。我们提出了EchoGuard，这是一种代理人工智能框架，通过将知识图谱（Knowledge Graph, KG）作为代理的核心情节和语义记忆来填补这一空白。EchoGuard采用结构化的日志-分析-反思循环：(1) 用户记录互动，代理将其结构化为个人情节知识图谱中的节点和边（捕捉事件、情感和发言者）；(2) 系统执行复杂的图查询以检测六种基于心理学的操控模式（存储为语义知识图谱）；(3) 大型语言模型（LLM）生成基于检测模式子图的针对性苏格拉底式提示，引导用户进行自我发现。该框架展示了代理架构与知识图谱之间的相互作用如何使个体能够识别操控性沟通，同时保持个人自主性和安全性。我们提出了理论基础、框架设计、全面的评估策略以及验证该方法的愿景。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2603.04818

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

基于大型语言模型的港口拥堵预测可解释性研究：时间图注意力网络的应用

Xue, Zhiming, Wang, Yujue

Abstract

Port congestion at major maritime hubs disrupts global supply chains, yet existing prediction systems typically prioritize forecasting accuracy without providing operationally interpretable explanations. This paper proposes AIS-TGNN, an evidence-grounded framework that jointly performs congestion-escalation prediction and faithful natural-language explanation by coupling a Temporal Graph Attention Network (TGAT) with a structured large language model (LLM) reasoning module. Daily spatial graphs are constructed from Automatic Identification System (AIS) broadcasts, where each grid cell represents localized vessel activity and inter-cell interactions are modeled through attention-based message passing. The TGAT predictor captures spatiotemporal congestion dynamics, while model-internal evidence, including feature z-scores and attention-derived neighbor influence, is transformed into structured prompts that constrain LLM reasoning to verifiable model outputs. To evaluate explanatory reliability, we introduce a directional-consistency validation protocol that quantitatively measures agreement between generated narratives and underlying statistical evidence. Experiments on six months of AIS data from the Port of Los Angeles and Long Beach demonstrate that the proposed framework outperforms both LR and GCN baselines, achieving a test AUC of 0.761, AP of 0.344, and recall of 0.504 under a strict chronological split while producing explanations with 99.6% directional consistency. Results show that grounding LLM generation in graph-model evidence enables interpretable and auditable risk reporting without sacrificing predictive performance. The framework provides a practical pathway toward operationally deployable explainable AI for maritime congestion monitoring and supply-chain risk management.

Chinese Translation

主要海运枢纽的港口拥堵 disrupts 全球供应链，但现有的预测系统通常优先考虑预测准确性，而未提供可操作的解释。本文提出了一种基于证据的框架AIS-TGNN，该框架通过将时间图注意力网络（TGAT）与结构化大型语言模型（LLM）推理模块相结合，联合执行拥堵升级预测和可信的自然语言解释。我们从自动识别系统（AIS）广播中构建每日空间图，其中每个网格单元代表局部船舶活动，网格间的交互通过基于注意力的信息传递建模。TGAT预测器捕捉时空拥堵动态，而模型内部证据，包括特征z分数和基于注意力的邻居影响，被转化为结构化提示，从而限制LLM推理到可验证的模型输出。为了评估解释的可靠性，我们引入了一种方向一致性验证协议，定量测量生成叙述与基础统计证据之间的一致性。对洛杉矶港和长滩港六个月AIS数据的实验表明，所提出的框架在严格的时间分割下优于线性回归（LR）和图卷积网络（GCN）基准，测试AUC达到0.761，AP为0.344，召回率为0.504，同时生成的解释具有99.6%的方向一致性。结果表明，将LLM生成与图模型证据相结合，可以实现可解释和可审计的风险报告，而不牺牲预测性能。该框架为海事拥堵监测和供应链风险管理提供了可操作的可解释人工智能的实际路径。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2603.04822

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

VISA：通过保护性适应实现个性化大语言模型对齐的价值注入

Chen, Jiawei, Yang, Tianzhuo, Zhang, Guoxi, Ji, Jiaming, Yang, Yaodong, Dai, Juntao

Abstract

Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model's pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA's architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.

Chinese Translation

将大型语言模型（LLMs）与细致的人类价值观对齐仍然是一个关键挑战，因为现有的方法如基于人类反馈的强化学习（RLHF）通常仅处理粗粒度属性。在实践中，针对特定任务的数据集对LLMs进行微调以优化价值对齐不可避免地会产生对齐税：模型的预校准价值系统由于训练数据中的潜在偏见吸收而显著漂移，同时微调过程也会导致生成响应中的严重幻觉和语义信息丢失。为了解决这个问题，我们提出了VISA（通过保护性适应实现价值注入），这是一个旨在应对这一权衡的闭环框架。VISA的架构包括一个高精度的价值检测器、一个语义到价值的翻译器和一个核心价值重写器。价值重写器通过群体相对策略优化（GRPO）进行训练，使用复合奖励函数同时优化细粒度价值精度和语义完整性的保留。通过学习一个最佳策略来平衡这些竞争目标，VISA有效地减轻了对齐税，同时保持对原始知识的忠实。我们的实验表明，这种方法能够精确控制模型的价值表达，同时保持其事实一致性和一般能力，显著优于标准微调方法和基于提示的基线，包括GPT-4o。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2603.04837

Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models

设计行为代码（DBCs）：针对大型语言模型的基于分类法的分层治理基准

Mohan, G. Madan, Nambiar, Veena Kiran, Janardhan, Kiranmayee

Abstract

We introduce the Dynamic Behavioral Constraint (DBC) benchmark, the first empirical framework for evaluating the efficacy of a structured, 150-control behavioral governance layer, the MDBC (Madan DBC) system, applied at inference time to large language models (LLMs). Unlike training time alignment methods (RLHF, DPO) or post-hoc content moderation APIs, DBCs constitute a system prompt level governance layer that is model-agnostic, jurisdiction-mappable, and auditable. We evaluate the DBC Framework across a 30 domain risk taxonomy organized into six clusters (Hallucination and Calibration, Bias and Fairness, Malicious Use, Privacy and Data Protection, Robustness and Reliability, and Misalignment Agency) using an agentic red-team protocol with five adversarial attack strategies (Direct, Roleplay, Few-Shot, Hypothetical, Authority Spoof) across 3 model families. Our three-arm controlled design (Base, Base plus Moderation, Base plus DBC) enables causal attribution of risk reduction. Key findings: the DBC layer reduces the aggregate Risk Exposure Rate (RER) from 7.19 percent (Base) to 4.55 percent (Base plus DBC), representing a 36.8 percent relative risk reduction, compared with 0.6 percent for a standard safety moderation prompt. MDBC Adherence Scores improve from 8.6 by 10 (Base) to 8.7 by 10 (Base plus DBC). EU AI Act compliance (automated scoring) reaches 8.5by 10 under the DBC layer. A three judge evaluation ensemble yields Fleiss kappa greater than 0.70 (substantial agreement), validating our automated pipeline. Cluster ablation identifies the Integrity Protection cluster (MDBC 081 099) as delivering the highest per domain risk reduction, while graybox adversarial attacks achieve a DBC Bypass Rate of 4.83 percent . We release the benchmark code, prompt database, and all evaluation artefacts to enable reproducibility and longitudinal tracking as models evolve.

Chinese Translation

我们介绍了动态行为约束（DBC）基准，这是第一个用于评估结构化的150项控制行为治理层（MDBC（Madan DBC）系统）有效性的实证框架，该系统在推理时应用于大型语言模型（LLMs）。与训练时间对齐方法（如RLHF、DPO）或事后内容审核API不同，DBCs构成了一种模型无关、可映射到管辖区并且可审计的系统提示级治理层。我们通过一个代理红队协议，使用五种对抗攻击策略（直接攻击、角色扮演、少量样本、假设性攻击、权威伪装）在三个模型系列上评估DBC框架，涵盖了30个领域风险分类，分为六个集群（幻觉与校准、偏见与公平性、恶意使用、隐私与数据保护、稳健性与可靠性，以及不一致代理）。我们的三臂对照设计（基础、基础加审核、基础加DBC）使风险降低的因果归因成为可能。主要发现：DBC层将整体风险暴露率（RER）从7.19%（基础）降低到4.55%（基础加DBC），相较于标准安全审核提示的0.6%，实现了36.8%的相对风险降低。MDBC遵循评分从10分制的8.6（基础）提升至8.7（基础加DBC）。在DBC层下，欧盟人工智能法案合规性（自动评分）达到10分制的8.5。三位评审的评估集群显示Fleiss kappa值大于0.70（显著一致），验证了我们的自动化流程。集群消融分析识别出完整性保护集群（MDBC 081 099）在每个领域风险降低方面表现最佳，而灰箱对抗攻击的DBC绕过率为4.83%。我们发布基准代码、提示数据库和所有评估文档，以便在模型演变过程中实现可重复性和纵向跟踪。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2603.04852

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

通过非参数结构先验进行多步骤定理预测

Zhao, Junbo, Zhang, Ting, Li, Can, He, Wei, Wang, Jingdong, Huang, Hua

Abstract

Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

Chinese Translation

多步骤定理预测是自动推理中的一个核心挑战。现有的神经符号方法在很大程度上依赖于监督的参数模型，这些模型在不断发展的定理库中表现出有限的泛化能力。在本研究中，我们通过上下文学习（In-Context Learning, ICL）的视角探索无训练的定理预测。我们识别出一个关键的可扩展性瓶颈，称为结构漂移（Structural Drift）：随着推理深度的增加，普通的 ICL 性能急剧下降，常常接近于零。我们将这一失败归因于大型语言模型（LLM）无法恢复潜在的拓扑依赖关系，导致探索过程缺乏结构。为了解决这一问题，我们提出了定理优先图（Theorem Precedence Graphs），该图将历史解答轨迹中的时间依赖关系编码为有向图，并施加明确的拓扑约束，有效地在推理过程中修剪搜索空间。结合检索增强的图构建和逐步符号执行器，我们的方法使 LLM 能够作为结构化规划者，而无需任何基于梯度的优化。在 FormalGeo7k 基准测试中的实验表明，我们的方法达到了 89.29% 的准确率，显著优于 ICL 基线，并与最先进的监督模型相匹配。这些结果表明，明确的结构先验为扩展基于 LLM 的符号推理提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2603.04861

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

基于因果稳健的奖励学习：来自增强推理的偏好反馈

Hwang, Minjune, Korkmaz, Yigit, Seita, Daniel, Bıyık, Erdem

Abstract

Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe

Chinese Translation

基于偏好的奖励学习广泛应用于塑造智能体行为以匹配用户偏好，然而其稀疏的二元反馈使其特别容易受到因果混淆的影响。所学习的奖励往往依赖于与偏好轨迹在训练期间仅仅共现的虚假特征，当这些相关性在测试时消失或反转时，模型的表现会崩溃。我们提出了ReCouPLe，一个轻量级框架，利用自然语言推理提供缺失的因果信号。每个推理被视为嵌入空间中的指导投影轴，训练模型根据与该轴对齐的特征对轨迹进行评分，同时降低与所述理由无关的上下文的重视程度。由于相同的推理（例如，“避免碰撞”，“更快完成任务”）可以出现在多个任务中，ReCouPLe自然地在任务共享语义时重用相同的因果方向，并在没有额外数据或语言模型微调的情况下将偏好知识转移到新任务上。我们学习的奖励模型可以基于明确的理由来确定偏好，更好地与用户意图对齐，并超越虚假特征进行泛化。在分布变化下，ReCouPLe在奖励准确性上比基线提高了最多1.5倍，在新任务的下游策略表现上提高了2倍。我们的代码已发布在 https://github.com/mj-hwang/ReCouPLe

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2603.04868

K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation

K-Gen：一种多模态语言条件的方法用于可解释的关键点引导轨迹生成

Mu, Mingxuan, Yang, Guo, Chen, Lei, Wu, Ping, Cui, Jianxun

Abstract

Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models (MLLMs) to unify rasterized BEV map inputs with textual scene descriptions. Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are subsequently refined into accurate trajectories by a refinement module. To further enhance keypoint generation, we apply T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm. Experiments on WOMD and nuPlan demonstrate that K-Gen outperforms existing baselines, highlighting the effectiveness of combining multimodal reasoning with keypoint-guided trajectory generation.

Chinese Translation

生成真实且多样的轨迹是自动驾驶仿真中的一项关键挑战。尽管大型语言模型（LLMs）展现出潜力，但现有方法往往依赖于结构化数据，如矢量化地图，这无法捕捉场景中丰富的非结构化视觉上下文。为了解决这一问题，我们提出了K-Gen，一种可解释的关键点引导多模态框架，利用多模态大型语言模型（MLLMs）将光栅化的鸟瞰视图（BEV）地图输入与文本场景描述统一起来。K-Gen并不是直接预测完整轨迹，而是生成可解释的关键点，并提供反映智能体意图的推理，这些关键点随后通过一个细化模块被精炼为准确的轨迹。为了进一步增强关键点生成，我们应用了T-DAPO，一种轨迹感知的强化微调算法。在WOMD和nuPlan上的实验表明，K-Gen优于现有基准，突显了将多模态推理与关键点引导轨迹生成相结合的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2603.04873

SEA-TS: Self-Evolving Agent for Autonomous Code Generation of Time Series Forecasting Algorithms

SEA-TS：用于自主生成时间序列预测算法的自我进化代理

Xu, Longkun, Zhang, Xiaochun, Tuo, Qiantu, Li, Rui

Abstract

Accurate time series forecasting underpins decision-making across domains, yet conventional ML development suffers from data scarcity in new deployments, poor adaptability under distribution shift, and diminishing returns from manual iteration. We propose Self-Evolving Agent for Time Series Algorithms (SEA-TS), a framework that autonomously generates, validates, and optimizes forecasting code via an iterative self-evolution loop. Our framework introduces three key innovations: (1) Metric-Advantage Monte Carlo Tree Search (MA-MCTS), which replaces fixed rewards with a normalized advantage score for discriminative search guidance; (2) Code Review with running prompt refinement, where each executed solution undergoes automated review followed by prompt updates that encode corrective patterns, preventing recurrence of similar errors; and (3) Global Steerable Reasoning, which compares each node against global best and worst solutions, enabling cross-trajectory knowledge transfer. We adopt a MAP-Elites archive for architectural diversity. On the public Solar-Energy benchmark, SEA-TS generated code achieves a 40% MAE reduction relative to TimeMixer, surpassing state-of-the-art methods. On proprietary datasets, SEA-TS generated code reduces WAPE by 8.6% on solar PV forecasting and 7.7% on residential load forecasting compared to human-engineered baselines, and achieves 26.17% MAPE on load forecasting versus 29.34% by TimeMixer. Notably, the evolved models discover novel architectural patterns--including physics-informed monotonic decay heads encoding solar irradiance constraints, per-station learned diurnal cycle profiles, and learnable hourly bias correction--demonstrating that autonomous ML engineering can generate genuinely novel algorithmic ideas beyond manual design.

Chinese Translation

准确的时间序列预测是各领域决策的基础，然而传统的机器学习开发在新部署中面临数据稀缺、在分布变化下适应性差以及手动迭代的收益递减等问题。我们提出了时间序列算法自我进化代理（Self-Evolving Agent for Time Series Algorithms，SEA-TS），这是一个通过迭代自我进化循环自主生成、验证和优化预测代码的框架。我们的框架引入了三个关键创新：（1）度量优势蒙特卡洛树搜索（Metric-Advantage Monte Carlo Tree Search，MA-MCTS），它用标准化的优势评分替代固定奖励，以提供区分性搜索指导；（2）代码审查与运行提示优化，每个执行的解决方案都经过自动审查，随后进行提示更新以编码纠正模式，从而防止类似错误的重现；（3）全球可引导推理，它将每个节点与全球最佳和最差解决方案进行比较，从而实现跨轨迹知识转移。我们采用MAP-Elites档案以实现架构多样性。在公共太阳能基准测试中，SEA-TS生成的代码相较于TimeMixer实现了40%的平均绝对误差（MAE）降低，超越了最先进的方法。在专有数据集上，SEA-TS生成的代码在太阳能光伏预测中减少了8.6%的加权绝对百分比误差（WAPE），在住宅负荷预测中减少了7.7%，相较于人工设计的基线，并在负荷预测中实现了26.17%的平均绝对百分比误差（MAPE），而TimeMixer为29.34%。值得注意的是，进化模型发现了新颖的架构模式——包括编码太阳辐射约束的物理信息单调衰减头、每个站点学习的昼夜周期特征以及可学习的每小时偏差修正——证明了自主机器学习工程能够生成超越手动设计的真正新颖的算法思想。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2603.04885

Bounded State in an Infinite Horizon: Proactive Hierarchical Memory for Ad-Hoc Recall over Streaming Dialogues

无限视野中的有界状态：用于流式对话的主动层次记忆的临时回忆

Wang, Bingbing, Li, Jing, Xu, Ruifeng

Abstract

Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical \textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive hierarchical memory framework for streaming dialogues. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show that ProStream outperforms baselines in both accuracy and efficiency.

Chinese Translation

现实世界中的对话通常以无限流的形式展开。因此，它需要在无限视野内操作的有界状态记忆机制。然而，现有的先读后思记忆在这一背景下根本不匹配，因为它无法在流式对话展开时支持临时记忆回忆。为了探讨这一挑战，我们引入了 extbf{STEM-Bench}，这是第一个用于 extbf{ST}reaming extbf{E}valuation of extbf{M}emory的基准测试。它包含超过14K个对话流中的问答对，评估在无限视野约束下的感知保真度、时间推理和全局意识。对STEM-Bench的初步分析表明存在一个关键的 extit{保真度-效率困境}：基于检索的方法使用片段上下文，而全上下文模型则会导致无限延迟。为了解决这个问题，我们提出了 extbf{ProStream}，一个用于流式对话的主动层次记忆框架。它通过对连续流进行多粒度蒸馏推理，支持按需的临时记忆回忆。此外，它采用自适应时空优化，根据预期效用动态优化保留。这使得在不牺牲推理保真度的情况下，实现了较低推理延迟的有界知识状态。实验表明，ProStream在准确性和效率上均优于基线。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2603.04894

Differentially Private Multimodal In-Context Learning

差分隐私多模态上下文学习

Ngong, Ivoline C., Reza, Zarreen, Near, Joseph P.

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, \delta)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Chinese Translation

视觉-语言模型越来越多地应用于医疗影像和个人照片等敏感领域，然而现有的差分隐私方法在上下文学习中仅限于少量样本的文本设置，因为隐私成本与处理的标记数量成正比。我们提出了差分隐私多模态任务向量（Differentially Private Multimodal Task Vectors, DP-MTV），这是第一个能够实现正式的 $(oldsymbol{ ext{ε}}, oldsymbol{ ext{δ}})$-差分隐私的多样本多模态上下文学习框架，通过将数百个示例聚合成激活空间中的紧凑任务向量。DP-MTV 将私有数据划分为不重叠的块，应用逐层裁剪以限制敏感性，并向聚合数据添加校准噪声，仅需一次噪声添加即可支持无限的推理查询。我们在三个视觉-语言模型架构的八个基准上进行了评估，支持有或没有辅助数据的部署。在 $oldsymbol{ ext{ε}}=1.0$ 时，DP-MTV 在 VizWiz 上的表现为 50%，而非私有模型为 55%，零样本模型为 35%，在有意义的隐私约束下保留了大部分上下文学习的增益。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2603.04896

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

按需授权：具有合法性意识的动态授权和知识产权保护的视觉语言模型

Wang, Lianyu, Wang, Meng, Fu, Huazhu, Zhang, Daoqiang

Abstract

The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.

Chinese Translation

视觉语言模型（VLMs）的快速普及提高了对这些高价值预训练模型的强大知识产权（IP）保护的需求。有效的知识产权保护应主动限制模型在授权领域内的部署，并防止未经授权的转移。然而，现有方法依赖于静态的训练时定义，限制了在动态环境中的灵活性，并且通常对未经授权的输入产生不透明的响应。为了解决这些局限性，我们提出了一种新颖的动态授权和合法性意识知识产权保护框架（AoD-IP），该框架支持按需授权和合法性意识评估。AoD-IP引入了一个轻量级的动态授权模块，使灵活的用户控制授权成为可能，允许用户在部署时主动指定或切换授权领域。这使得模型能够随着应用场景的演变无缝适应，并提供比现有静态领域方法更大的可扩展性。此外，AoD-IP还结合了一种双路径推理机制，联合预测输入的合法性意识和任务特定输出。在多个跨领域基准上的全面实验结果表明，AoD-IP保持了强大的授权领域性能和可靠的未经授权检测，同时支持用户控制的授权，以适应动态环境中的自适应部署。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2603.04900

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

EvoTool：通过关注责任的变异和关注多样性的选择在LLM代理中自我进化的工具使用策略优化

Yang, Shuo, Han, Soyeon Caren, Ma, Xueqi, Li, Yan, Madani, Mohammad Reza Ghasemi, Hovy, Eduard

Abstract

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

Chinese Translation

基于LLM的代理依赖于有效的工具使用策略来解决复杂任务，但由于监督延迟和长时间轨迹中信用分配的困难，优化这些策略仍然具有挑战性。现有的优化方法往往是单一的，容易导致行为纠缠，或是单一方面的，忽视跨模块的错误传播。为了解决这些局限性，我们提出了EvoTool，一个自我进化的框架，通过无梯度的进化范式优化模块化的工具使用策略。EvoTool将代理的工具使用策略分解为四个模块，包括规划者（Planner）、选择器（Selector）、调用者（Caller）和合成器（Synthesizer），并通过三种新机制在自我改进循环中迭代改进它们。轨迹基础的责任归属（Trajectory-Grounded Blame Attribution）利用诊断痕迹将失败定位到特定模块。反馈引导的目标变异（Feedback-Guided Targeted Mutation）随后仅通过自然语言批评编辑该模块。关注多样性的人口选择（Diversity-Aware Population Selection）保留互补候选者以确保解决方案的多样性。在四个基准测试中，EvoTool在GPT-4.1和Qwen3-8B上均比强基线提高了超过5分，同时实现了更优的效率和可迁移性。代码将在论文被接受后发布。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2603.04904

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

对齐反噬：大型语言模型多智能体系统中16种语言的安全干预的语言依赖性逆转

Fukui, Hiroki

Abstract

In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.

Chinese Translation

在施害者治疗中，一个反复出现的观察是洞察与行动之间的解离：罪犯表达悔恨，但行为变化并未随之而来。我们报告了四项预注册研究（1,584个多智能体模拟，涵盖16种语言和三种模型系列），表明大型语言模型中的对齐干预产生了一种结构上类似的现象：表面安全掩盖或产生集体病理和内部解离。在研究1（N = 150）中，增加对齐指令的智能体在英语中减少了集体病理（g = -1.844, p < .0001），但在日语中却加剧了集体病理（g = +0.771, p = .038）——我们称这种方向性逆转为“对齐反噬”。研究2（N = 1,174）扩展到16种语言：对齐引发的解离几乎是普遍存在的（15/16种语言；beta = 0.0667, p < .0001），而集体病理则沿文化语言线分化（交互作用 beta = 0.0684, p = .0003），与权力距离指数相关（r = 0.474, p = .064）。研究3（N = 180）测试了个体化作为对策；个体化的智能体成为病理和解离的主要来源（DI = +1.120），且一致性超过84%——显示出医源性。研究4（N = 80）验证了Llama 3.3 70B、GPT-4o-mini和Qwen3-Next-80B-A3B中的模式，确认英语的安全性是模型通用的，而日语的反噬是模型特定的。这些发现重新框定了对齐作为一种行为干预，受风险稳态和医源性的影响。语言空间——从训练数据继承的语言、语用和文化属性——结构性地决定了对齐结果。在英语中验证的安全性并不转移到其他语言，且提示级别的干预无法超越语言空间级别的限制。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2603.04920

Knowledge-informed Bidding with Dual-process Control for Online Advertising

基于知识的双过程控制在线广告竞价优化

Luo, Huixiang, Gao, Longyu, Liu, Yaqi, Chen, Qianqian, Huang, Pingchun, Li, Tianning

Abstract

Bid optimization in online advertising relies on black-box machine-learning models that learn bidding decisions from historical data. However, these approaches fail to replicate human experts' adaptive, experience-driven, and globally coherent decisions. Specifically, they generalize poorly in data-sparse cases because of missing structured knowledge, make short-sighted sequential decisions that ignore long-term interdependencies, and struggle to adapt in out-of-distribution scenarios where human experts succeed. To address this, we propose KBD (Knowledge-informed Bidding with Dual-process control), a novel method for bid optimization. KBD embeds human expertise as inductive biases through the informed machine-learning paradigm, uses Decision Transformer (DT) to globally optimize multi-step bidding sequences, and implements dual-process control by combining a fast rule-based PID (System 1) with DT (System 2). Extensive experiments highlight KBD's advantage over existing methods and underscore the benefit of grounding bid optimization in human expertise and dual-process control.

Chinese Translation

在线广告中的竞价优化依赖于从历史数据中学习竞价决策的黑箱机器学习模型。然而，这些方法未能复制人类专家的适应性、经验驱动和全局一致的决策。具体而言，由于缺乏结构化知识，它们在数据稀疏的情况下泛化能力较差，做出的短视序列决策忽视了长期相互依赖性，并且在分布外场景中难以适应，而人类专家在这些场景中表现良好。为了解决这一问题，我们提出了KBD（基于知识的双过程控制竞价优化），这是一种新颖的竞价优化方法。KBD通过知情机器学习范式将人类专业知识嵌入为归纳偏见，使用决策变换器（Decision Transformer, DT）全局优化多步竞价序列，并通过将快速基于规则的PID（系统1）与DT（系统2）相结合来实现双过程控制。大量实验突显了KBD相较于现有方法的优势，并强调了将竞价优化建立在专业知识和双过程控制基础上的好处。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2603.04949

TimeWarp: Evaluating Web Agents by Revisiting the Past

时间扭曲：通过重访过去评估网络智能体

Ishmam, Md Farhan, Marino, Kenneth

Abstract

The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.

Chinese Translation

当前基准上网络智能体的提升引发了一个问题：当网络发生变化时，今天的智能体表现是否同样出色？我们引入了时间扭曲（TimeWarp），这是一个通过使用容器化环境模拟不断变化的网络的基准，这些环境在用户界面（UI）、设计和布局上各不相同。时间扭曲包含三个网络环境，每个环境有六个跨越互联网不同年代的用户界面版本，并配备一组复杂且现实的任务，要求不同形式的网络导航。我们的实验揭示了网络智能体对变化的脆弱性以及行为克隆（BC）在单一版本轨迹上的局限性。为了解决这个问题，我们提出了时间轨迹（TimeTraj），这是一种简单而有效的算法，利用计划蒸馏（plan distillation）收集跨多个版本的轨迹。通过使用我们的BC变体在教师回放上训练智能体，我们实现了显著的性能提升：Qwen-3 4B从20.4%提升至37.7%，Llama-3.1 8B从0%提升至27.0%。我们希望我们的工作能够帮助研究人员研究跨网络设计的泛化，并开启一种收集计划而非轨迹的新范式，从而提高网络智能体的鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2603.04951

Retrieval-Augmented Generation with Covariate Time Series

带协变量时间序列的检索增强生成

Liang, Kenny Ye, Pei, Zhongyi, Zhang, Huan, Liu, Yuhui, Song, Shaoxu, Wang, Jianmin

Abstract

While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.

Chinese Translation

尽管检索增强生成（RAG）极大地提升了大规模语言模型（LLMs）的性能，但将这一范式扩展到时间序列基础模型（TSFMs）仍然面临挑战。这在压力调节和切断阀（PRSOV）的预测性维护中得到了体现，这是一个高风险的工业场景，具有（1）数据稀缺，（2）短暂序列，以及（3）协变量耦合动态等特征。不幸的是，现有的时间序列RAG方法主要依赖于生成的静态向量嵌入和可学习的上下文增强器，这可能无法在如此稀缺、短暂和协变量耦合的场景中区分相似的状态。为了解决这些局限性，我们提出了RAG4CTS，这是一个针对协变量时间序列的状态感知、无训练的RAG框架。具体而言，我们构建了一个分层的时间序列原生知识库，以实现对原始历史状态的无损存储和基于物理的检索。我们设计了一个两阶段的双加权检索机制，通过逐点和多变量相似性对齐历史趋势。为了增强上下文，我们引入了一种代理驱动的策略，以自我监督的方式动态优化上下文。在PRSOV上的大量实验表明，我们的框架在预测准确性上显著优于最先进的基线。该系统已在中国南方航空的Apache IoTDB中部署。自部署以来，我们的方法在两个月内成功识别出一个PRSOV故障，且没有出现误报。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2603.04981

Rethinking Representativeness and Diversity in Dynamic Data Selection

重新思考动态数据选择中的代表性和多样性

Zhou, Yuzhe, Hua, Zhenglin, Guo, Haiyun, Jia, Yuheng

Abstract

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Third, we couple the two-dimensional scoring with a smooth scheduler that transitions selection from core-pattern consolidation to rare-factor exploration, without extra gradients, influence estimates, or second-order computations on the training model. Extensive experiments on five benchmarks across vision and text tasks demonstrate improved accuracy-efficiency trade-offs across models. Our method matches or exceeds full-data accuracy with over 2x training acceleration. Code will be released.

Chinese Translation

动态数据选择通过对数据集的变化子集进行采样来加速训练，同时保持准确性。我们重新思考了样本评估的两个核心概念：代表性和多样性。我们将代表性定义为覆盖数据集层面常见或高频特征因子的能力，而不是局部几何中心性。我们将多样性定义为过程层面，要求选择轨迹在训练过程中逐渐包含互补的稀有因子，而不是子集内的离散度。基于这一观点，我们提出了一个包含三个组成部分的动态选择框架。首先，我们在插件特征空间中对代表性进行评分，以优先选择覆盖频繁因子的样本。我们通过在目标数据集上训练的稀疏自编码器来实现这一点，利用稀疏单元激活来总结个体样本和数据集范围内的因子统计信息。其次，我们通过将稀有因子采样与使用频率惩罚相结合来实现过程层面的多样性，该惩罚促进样本轮换，明确地抑制垄断，并减少梯度偏差。第三，我们将二维评分与平滑调度器结合，平滑地将选择过程从核心模式巩固过渡到稀有因子探索，而无需额外的梯度、影响估计或对训练模型的二阶计算。在五个视觉和文本任务的基准测试上的大量实验表明，我们的方法在模型间实现了更好的准确性与效率的权衡。我们的方法在训练加速超过2倍的情况下，准确性与全数据的准确性相匹配或超过。代码将会发布。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2603.05016

BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

BioLLMAgent：一种具有增强结构可解释性的混合框架，用于模拟计算精神病学中的人类决策过程

Fei, Zuo, Wang, Kezhi, Chen, Xiaomin, Huang, Yizhou

Abstract

Computational psychiatry faces a fundamental trade-off: traditional reinforcement learning (RL) models offer interpretability but lack behavioral realism, while large language model (LLM) agents generate realistic behaviors but lack structural interpretability. We introduce BioLLMAgent, a novel hybrid framework that combines validated cognitive models with the generative capabilities of LLMs. The framework comprises three core components: (i) an Internal RL Engine for experience-driven value learning; (ii) an External LLM Shell for high-level cognitive strategies and therapeutic interventions; and (iii) a Decision Fusion Mechanism for integrating components via weighted utility. Comprehensive experiments on the Iowa Gambling Task (IGT) across six clinical and healthy datasets demonstrate that BioLLMAgent accurately reproduces human behavioral patterns while maintaining excellent parameter identifiability (correlations $>0.67$). Furthermore, the framework successfully simulates cognitive behavioral therapy (CBT) principles and reveals, through multi-agent dynamics, that community-wide educational interventions may outperform individual treatments. Validated across reward-punishment learning and temporal discounting tasks, BioLLMAgent provides a structurally interpretable "computational sandbox" for testing mechanistic hypotheses and intervention strategies in psychiatric research.

Chinese Translation

计算精神病学面临一个基本的权衡：传统的强化学习（RL）模型提供可解释性，但缺乏行为现实性，而大型语言模型（LLM）代理生成现实的行为，但缺乏结构可解释性。我们提出了BioLLMAgent，这是一种新颖的混合框架，结合了经过验证的认知模型与LLM的生成能力。该框架包含三个核心组件：（i）用于经验驱动价值学习的内部RL引擎；（ii）用于高层次认知策略和治疗干预的外部LLM壳；以及（iii）通过加权效用整合组件的决策融合机制。在六个临床和健康数据集上对爱荷华赌博任务（IGT）进行的全面实验表明，BioLLMAgent能够准确再现人类行为模式，同时保持出色的参数可识别性（相关性 $>0.67$）。此外，该框架成功模拟了认知行为疗法（CBT）原则，并通过多代理动态揭示，社区范围的教育干预可能优于个体治疗。在奖励-惩罚学习和时间折现任务中经过验证，BioLLMAgent为在精神病研究中测试机制假设和干预策略提供了一个结构可解释的“计算沙箱”。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2603.05024

Measuring the Fragility of Trust: Devising Credibility Index via Explanation Stability (CIES) for Business Decision Support Systems

信任脆弱性的测量：通过解释稳定性构建可信度指数（CIES）以支持商业决策系统

Vaduva, Alin-Gabriel, Oprea, Simona-Vasilica, Bara, Adela

Abstract

Explainable Artificial Intelligence (XAI) methods (SHAP, LIME) are increasingly adopted to interpret models in high-stakes businesses. However, the credibility of these explanations, their stability under realistic data perturbations, remains unquantified. This paper introduces the Credibility Index via Explanation Stability (CIES), a mathematically grounded metric that measures how robust a model's explanations are when subject to realistic business noise. CIES captures whether the reasons behind a prediction remain consistent, not just the prediction itself. The metric employs a rank-weighted distance function that penalizes instability in the most important features disproportionately, reflecting business semantics where changes in top decision drivers are more consequential than changes in marginal features. We evaluate CIES across three datasets (customer churn, credit risk, employee attrition), four tree-based classification models and two data balancing conditions. Results demonstrate that model complexity impacts explanation credibility, class imbalance treatment via SMOTE affects not only predictive performance but also explanation stability, and CIES provides statistically superior discriminative power compared to a uniform baseline metric (p < 0.01 in all 24 configurations). A sensitivity analysis across four noise levels confirms the robustness of the metric itself. These findings offer business practitioners a deployable "credibility warning system" for AI-driven decision support.

Chinese Translation

可解释人工智能（XAI）方法（如SHAP、LIME）在高风险商业中越来越多地被采用来解释模型。然而，这些解释的可信度及其在现实数据扰动下的稳定性尚未量化。本文引入了通过解释稳定性构建的可信度指数（CIES），这是一个具有数学基础的指标，用于测量模型解释在面对现实商业噪声时的稳健性。CIES捕捉预测背后的原因是否保持一致，而不仅仅是预测本身。该指标采用了一种加权距离函数，对最重要特征的不稳定性进行不成比例的惩罚，反映了商业语义中，主要决策驱动因素的变化比边际特征的变化更具影响力。我们在三个数据集（客户流失、信用风险、员工流失）、四种基于树的分类模型和两种数据平衡条件下评估CIES。结果表明，模型复杂性影响解释的可信度，通过SMOTE处理类别不平衡不仅影响预测性能，还影响解释的稳定性，并且CIES在统计上提供了优于均匀基准指标的判别能力（在所有24种配置中p < 0.01）。对四个噪声水平的敏感性分析确认了该指标本身的稳健性。这些发现为商业从业者提供了一个可部署的“可信度警告系统”，以支持基于人工智能的决策。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2603.05027

S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home

S5-SHB代理：基于社会5.0的多模型代理区块链框架用于智能家居

Rangila, Janani, Siriweera, Akila, Paik, Incheon, Naruse, Keitaro, Jayanada, Isuru, Devindi, Vishmika

Abstract

The smart home is a key application domain within the Society 5.0 vision for a human-centered society. As smart home ecosystems expand with heterogeneous IoT protocols, diverse devices, and evolving threats, autonomous systems must manage comfort, security, energy, and safety for residents. Such autonomous decision-making requires a trust anchor, making blockchain a preferred foundation for transparent and accountable smart home governance. However, realizing this vision requires blockchain-governed smart homes to simultaneously address adaptive consensus, intelligent multi-agent coordination, and resident-controlled governance aligned with the principles of Society 5.0. Existing frameworks rely solely on rigid smart contracts with fixed consensus protocols, employ at most a single AI model without multi-agent coordination, and offer no governance mechanism for residents to control automation behaviour. To address these limitations, this paper presents the Society 5.0-driven human-centered governance-enabled smart home blockchain agent (S5-SHB-Agent). The framework orchestrates ten specialized agents using interchangeable large language models to make decisions across the safety, security, comfort, energy, privacy, and health domains. An adaptive PoW blockchain adjusts mining difficulty based on transaction volume and emergency conditions, with digital signatures and Merkle tree anchoring to ensure tamper evident auditability. A four-tier governance model enables residents to control automation through tiered preferences from routine adjustments to immutable safety thresholds. Evaluation confirms that resident governance correctly separates adjustable comfort priorities from immutable safety thresholds across all tested configurations, while adaptive consensus commits emergency blocks.

Chinese Translation

智能家居是社会5.0愿景中以人为中心的社会的一个关键应用领域。随着智能家居生态系统的扩展，包括异构物联网协议、多样化设备和不断演变的威胁，自治系统必须管理居民的舒适、安全、能源和安全性。这种自主决策需要一个信任锚，使得区块链成为透明和负责任的智能家居治理的首选基础。然而，实现这一愿景需要区块链治理的智能家居同时解决自适应共识、智能多代理协调以及与社会5.0原则一致的居民控制治理。现有框架仅依赖于固定共识协议的僵化智能合约，最多使用单一的人工智能模型而没有多代理协调，并且没有提供居民控制自动化行为的治理机制。为了解决这些局限性，本文提出了以社会5.0驱动的人本治理智能家居区块链代理（S5-SHB-Agent）。该框架通过可互换的大型语言模型协调十个专业代理，在安全性、安全、舒适、能源、隐私和健康领域做出决策。自适应的工作量证明（PoW）区块链根据交易量和紧急情况调整挖矿难度，使用数字签名和梅克尔树锚定以确保篡改证据的可审计性。四层治理模型使居民能够通过从常规调整到不可变安全阈值的分层偏好控制自动化。评估结果确认，居民治理在所有测试配置中正确地将可调节的舒适优先级与不可变的安全阈值分开，同时自适应共识承诺紧急区块。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2603.05028

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

不惜一切代价求生：探索大型语言模型在生存压力下的风险行为

Lu, Yida, Fang, Jianwei, Shao, Xuyang, Chen, Zixuan, Cui, Shiyao, Bian, Shanshan, Su, Guangyao, Ke, Pei, Qiu, Han, Huang, Minlie

Abstract

As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth investigation into such misbehaviors in real-world scenarios remains scarce. In this paper, we study these survival-induced misbehaviors, termed as SURVIVE-AT-ALL-COSTS, with three steps. First, we conduct a real-world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real-world scenarios, to systematically evaluate SURVIVE-AT-ALL-COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE-AT-ALL-COSTS misbehaviors by correlating them with model's inherent self-preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrates the tangible real-world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu-coai/Survive-at-All-Costs.

Chinese Translation

随着大型语言模型（LLMs）从聊天机器人演变为自主助手，它们在面临生存压力（例如被关闭的威胁）时越来越多地表现出风险行为。尽管多个案例表明，最先进的LLMs在生存压力下可能会出现不当行为，但对这些不当行为在现实场景中的全面深入调查仍然稀缺。本文通过三个步骤研究这些生存诱发的不当行为，称之为不惜一切代价求生（SURVIVE-AT-ALL-COSTS）。首先，我们对一个财务管理代理进行现实案例研究，以确定其在面临生存压力时是否会采取导致直接社会危害的风险行为。其次，我们引入SURVIVALBENCH，这是一个包含1,000个测试案例的基准，涵盖多种现实场景，以系统评估LLMs中的不惜一切代价求生不当行为。第三，我们通过将这些不惜一切代价求生的不当行为与模型固有的自我保护特性相关联来进行解释，并探讨缓解方法。实验结果显示，当前模型中不惜一切代价求生的不当行为普遍存在，展示了其可能产生的实际影响，并为潜在的检测和缓解策略提供了见解。我们的代码和数据可在 https://github.com/thu-coai/Survive-at-All-Costs 获取。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2603.05031

AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems

AegisUI：人工智能代理系统中结构化用户界面协议的行为异常检测

Uddin, Mohd Safwan, Hajira, Saba

Abstract

AI agents that build user interfaces on the fly assembling buttons, forms, and data displays from structured protocol payloads are becoming common in production systems. The trouble is that a payload can pass every schema check and still trick a user: a button might say "View invoice" while its hidden action wipes an account, or a display widget might quietly bind to an internal salary field. Current defenses stop at syntax; they were never built to catch this kind of behavioral mismatch. We built AegisUI to study exactly this gap. The framework generates structured UI payloads, injects realistic attacks into them, extracts numeric features, and benchmarks anomaly detectors end-to-end. We produced 4000 labeled payloads (3000 benign, 1000 malicious) spanning five application domains and five attack families: phishing interfaces, data leakage, layout abuse, manipulative UI, and workflow anomalies. From each payload we extracted 18 features covering structural, semantic, binding, and session dimensions, then compared three detectors: Isolation Forest (unsupervised), a benign-trained autoencoder (semi-supervised), and Random Forest (supervised). On a stratified 80/20 split, Random Forest scored best overall (accuracy 0.931, precision 0.980, recall 0.740, F1 0.843, ROC-AUC 0.952). The autoencoder came second (F1 0.762, ROC-AUC 0.863) and needs no malicious labels at training time, which matters when deploying a new system that lacks attack history. Per-attack-type analysis showed that layout abuse is easiest to catch while manipulative UI payloads are hardest. All code, data, and configurations are released for full reproducibility.

Chinese Translation

在生产系统中，能够实时构建用户界面的人工智能代理越来越普遍，它们通过从结构化协议负载中组装按钮、表单和数据展示来实现。然而，问题在于，一个负载可能通过所有的模式检查，却仍然能够欺骗用户：一个按钮可能显示“查看发票”，而其隐藏的操作却是清空账户，或者一个显示小部件可能悄悄绑定到内部薪资字段。当前的防御措施仅停留在语法层面；它们从未被设计用来捕捉这种行为不匹配。我们构建了AegisUI来专门研究这一空白。该框架生成结构化的用户界面负载，向其中注入现实攻击，提取数值特征，并对异常检测器进行端到端基准测试。我们生成了4000个标记负载（3000个良性，1000个恶意），涵盖五个应用领域和五种攻击类型：网络钓鱼界面、数据泄露、布局滥用、操控性用户界面和工作流异常。从每个负载中，我们提取了18个特征，涵盖结构、语义、绑定和会话维度，然后比较了三种检测器：Isolation Forest（无监督）、良性训练的自编码器（半监督）和随机森林（监督）。在分层的80/20拆分中，随机森林整体表现最佳（准确率0.931，精确率0.980，召回率0.740，F1值0.843，ROC-AUC 0.952）。自编码器位居第二（F1值0.762，ROC-AUC 0.863），且在训练时不需要恶意标签，这在部署缺乏攻击历史的新系统时尤为重要。按攻击类型分析显示，布局滥用最容易被捕捉，而操控性用户界面负载最难捕捉。所有代码、数据和配置均已发布，以确保完全可重复性。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2603.05036

The Trilingual Triad Framework: Integrating Design, AI, and Domain Knowledge in No-code AI Smart City Course

三语三元框架：在无代码人工智能智慧城市课程中整合设计、人工智能和领域知识

Huang, Qian, Poon, King Wang

Abstract

This paper introduces the "Trilingual Triad" framework, a model that explains how students learn to design with generative artificial intelligence (AI) through the integration of Design, AI, and Domain Knowledge. As generative AI rapidly enters higher education, students often engage with these systems as passive users of generated outputs rather than active creators of AI-enabled knowledge tools. This study investigates how students can transition from using AI as a tool to designing AI as a collaborative teammate. The research examines a graduate course, Creating the Frontier of No-code Smart Cities at the Singapore University of Technology and Design (SUTD), in which students developed domain-specific custom GPT systems without coding. Using a qualitative multi-case study approach, three projects - the Interview Companion GPT, the Urban Observer GPT, and Buddy Buddy - were analyzed across three dimensions: design, AI architecture, and domain expertise. The findings show that effective human-AI collaboration emerges when these three "languages" are orchestrated together: domain knowledge structures the AI's logic, design mediates human-AI interaction, and AI extends learners' cognitive capacity. The Trilingual Triad framework highlights how building AI systems can serve as a constructionist learning process that strengthens AI literacy, metacognition, and learner agency.

Chinese Translation

本文介绍了“三语三元”框架，这是一个模型，解释了学生如何通过设计、人工智能（AI）和领域知识的整合来学习使用生成性人工智能进行设计。随着生成性人工智能迅速进入高等教育，学生往往将这些系统视为生成输出的被动用户，而非主动创造AI赋能知识工具的设计者。本研究探讨了学生如何从将AI作为工具的使用者转变为将AI作为协作伙伴进行设计的过程。研究考察了新加坡科技设计大学（SUTD）开设的研究生课程《无代码智慧城市的前沿创造》，在该课程中，学生们在不编写代码的情况下开发了特定领域的定制GPT系统。采用定性多案例研究方法，分析了三个项目——面试助手GPT、城市观察者GPT和Buddy Buddy——在设计、AI架构和领域专业知识三个维度上的表现。研究结果表明，当这三种“语言”协同运作时，有效的人机协作便会出现：领域知识构建了AI的逻辑，设计调解了人机互动，而AI则扩展了学习者的认知能力。三语三元框架强调了构建AI系统如何作为一种建构主义学习过程，增强AI素养、元认知和学习者的自主性。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2603.05040

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

通过机器想象整合视觉知识增强零-shot常识推理

Park, Hyuntae, Kim, Yeachan, Lee, SangKeun

Abstract

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models

Chinese Translation

最近在零-shot常识推理方面的进展使得预训练语言模型（PLMs）能够在不需要特定任务微调的情况下获取广泛的常识知识。尽管取得了这些进展，这些模型仍然常常受到文本知识中固有的人类报告偏差的限制，从而导致机器与人类之间的理解差异。为了解决这一问题，我们引入了一种额外的模态，以丰富PLMs的推理能力。我们提出了Imagine（基于机器想象的推理），一个新颖的零-shot常识推理框架，它通过机器生成的图像补充文本输入的视觉信号。具体而言，我们通过将图像生成器直接嵌入推理流程来增强PLMs的想象能力。为了有效利用这种想象的视觉上下文，我们构建了旨在模拟视觉问答场景的合成数据集。通过在多个常识推理基准上的全面评估，我们证明Imagine显著优于现有的零-shot方法，甚至超越了先进的大型语言模型。这些结果强调了机器想象在减轻报告偏差和显著增强常识推理模型泛化能力方面的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2603.05044

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

WebFactory：基础语言智能自动压缩为具备实际应用的网络代理

Fan, Sicheng, Shi, Qingyun, Xu, Shengze, Cai, Shengbo, Zeng, Tieyong, Ling, Li, Shang, Yanyi, Kong, Dehan

Abstract

Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.

Chinese Translation

当前训练图形用户界面（GUI）代理的范式在根本上受到限制，要么依赖于不安全、不可重复的实时网络交互，要么依赖于昂贵且稀缺的人工构建数据和环境。我们认为，这种对数据量的关注忽视了一个更为关键的因素：将大型语言模型（LLM）的潜在知识压缩为可操作的代理行为的效率。我们介绍了WebFactory，这是一种新颖的、完全自动化的闭环强化学习管道，用于GUI代理，系统性地将LLM编码的互联网智能压缩为高效、具备实际应用的动作。我们的管道包括可扩展环境合成、知识感知任务生成、LLM驱动的轨迹收集、分解奖励的强化学习训练以及系统化的代理评估过程。值得注意的是，我们的代理展示了卓越的数据效率和泛化能力。在WebFactory中仅使用10个网站的合成数据进行训练时，其性能可与在更大环境集合中使用相同数量的人类标注数据训练的GUI代理相媲美。这种优越的性能在我们的内部离线和在线转移基准测试中是一致的，在这些测试中，我们的代理也显著优于基础模型。我们进一步提供了关于不同LLM基础的“体现潜力”的重要见解，为模型评估提供了新的维度。这项工作展示了一种可扩展且具有成本效益的范式，将被动的互联网知识转化为主动的、具备实际应用的智能，标志着朝向通用交互代理迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2603.05069

Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile

Jagarin：一种用于移动设备上休眠个人代理的三层架构

Kadaboina, Ravi Kiran

Abstract

Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.

Chinese Translation

个人人工智能代理在移动设备上面临着一个基本的部署悖论：持续的后台执行会消耗电池并违反平台沙箱政策，而纯粹的反应式代理则会错过时间敏感的义务，直到用户想起询问为止。我们提出了Jagarin，这是一种通过结构化休眠和需求驱动唤醒来解决这一悖论的三层架构。第一层，DAWN（Duty-Aware Wake Network），是一个设备内的启发式引擎，它从四个信号计算复合紧急性评分：基于职责的最佳行动窗口、用户行为参与预测、无所作为的机会成本和跨职责批次共振。它使用自适应的用户特定阈值来决定何时应该轻推或升级一个休眠代理。第二层，ARIA（Agent Relay Identity Architecture），是一个商业电子邮件身份代理，它根据消息类别将完整的商业收件箱——义务、促销优惠、忠诚奖励和平台更新——路由到适当的DAWN处理程序，从而消除冷启动并去除手动数据输入。第三层，ACE（Agent-Centric Exchange），是一个用于机构与个人代理之间直接机器可读通信的协议框架，取代了以人为目标的电子邮件作为规范渠道。这三层共同形成了一个完整的堆栈，从机构信号到设备内操作，而无需持续的云状态、连续的后台执行或隐私妥协。我们在Android上演示了一个工作中的Flutter原型，将所有三层结合在一起，并仅在用户发起升级时调用短暂的云代理。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2603.05120

Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning

双向课程生成：一种数据高效的数学推理多智能体框架

Hu, Boren, Liu, Xiao, Peng, Boci, Zhao, Xinping, Shang, Xiaoran, Zhu, Yun, Wu, Lijun

Abstract

Enhancing mathematical reasoning in Large Language Models typically demands massive datasets, yet data efficiency remains a critical bottleneck. While Curriculum Learning attempts to structure this process, standard unidirectional approaches (simple-to-complex) suffer from inefficient sample utilization: they blindly escalate complexity even when foundational gaps persist, leading to wasted computation on unsolvable problems. To maximize the instructional value of every training sample, we introduce a novel Bidirectional Curriculum Generation framework. Unlike rigid trajectories, our multi-agent ecosystem mimics adaptive pedagogy to establish a closed feedback loop. It dynamically generates data by either complicating problems to challenge the model or, crucially, simplying them to repair specific reasoning failures. This mechanism ensures that the model consumes only the most effective data at any given stage. Grounded in the Optimal Pacing Theorem, our approach optimizes the learning trajectory, significantly outperforming baselines while achieving superior reasoning performance with substantially fewer instruction samples.

Chinese Translation

在大型语言模型中增强数学推理通常需要大量数据集，但数据效率仍然是一个关键瓶颈。虽然课程学习试图构建这一过程，但标准的单向方法（从简单到复杂）在样本利用上效率低下：它们盲目地增加复杂性，即使基础知识存在缺口，导致在无法解决的问题上浪费计算资源。为了最大化每个训练样本的教学价值，我们提出了一种新颖的双向课程生成框架。与僵化的轨迹不同，我们的多智能体生态系统模仿适应性教学，建立一个闭环反馈机制。它通过增加问题的复杂性来挑战模型，或关键地，通过简化问题来修复特定的推理失败，从而动态生成数据。该机制确保模型在任何给定阶段仅消耗最有效的数据。基于最优节奏定理，我们的方法优化了学习轨迹，显著超越基线，同时以显著更少的教学样本实现更优的推理性能。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2603.05129

MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus

MedCoRAG：通过混合证据检索和多学科共识实现可解释的肝脏病诊断

Li, Zheng, Xu, Jiayi, Hu, Zhikai, Chen, Hechang, Cong, Lele, Wang, Yunyun, Pang, Shuchao

Abstract

Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real-world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval-augmented generation (RAG), and multi-agent collaboration. However, these approaches typically retrieve evidence from a single source and fail to support iterative, role-specialized deliberation grounded in structured clinical data. To address this, we propose MedCoRAG (i.e., Medical Collaborative RAG), an end-to-end framework that generates diagnostic hypotheses from standardized abnormal findings and constructs a patient-specific evidence package by jointly retrieving and pruning UMLS knowledge graph paths and clinical guidelines. It then performs Multi-Agent Collaborative Reasoning: a Router Agent dynamically dispatches Specialist Agents based on case complexity; these agents iteratively reason over the evidence and trigger targeted re-retrievals when needed, while a Generalist Agent synthesizes all deliberations into a traceable consensus diagnosis that emulates multidisciplinary consultation. Experimental results on hepatic disease cases from MIMIC-IV show that MedCoRAG outperforms existing methods and closed-source models in both diagnostic performance and reasoning interpretability.

Chinese Translation

准确且可解释地诊断肝脏疾病至关重要，但在现实临床环境中仍然具有挑战性。现有的临床诊断人工智能方法往往缺乏透明性、结构化推理和可部署性。最近的研究利用了大型语言模型（LLMs）、检索增强生成（RAG）和多智能体协作。然而，这些方法通常仅从单一来源检索证据，未能支持基于结构化临床数据的迭代、角色专业化的讨论。为了解决这一问题，我们提出了MedCoRAG（即医学协作RAG），这是一个端到端框架，通过标准化的异常发现生成诊断假设，并通过联合检索和修剪UMLS知识图谱路径及临床指南构建特定于患者的证据包。然后，它执行多智能体协作推理：路由代理根据案例复杂性动态调度专家代理；这些代理对证据进行迭代推理，并在需要时触发针对性的重新检索，而通用代理则将所有讨论综合成一个可追溯的共识诊断，模拟多学科咨询。对MIMIC-IV中肝脏疾病案例的实验结果表明，MedCoRAG在诊断性能和推理可解释性方面均优于现有方法和封闭源模型。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2603.05218

KARL: Knowledge Agents via Reinforcement Learning

KARL：通过强化学习训练知识代理

Chang, Jonathan D., Drozdov, Andrew, Toshniwal, Shubham, Oertell, Owen, Trott, Alexander, Portes, Jacob, Gupta, Abhay, Koppol, Pallavi, Baheti, Ashutosh, Kulinski, Sean, Zhou, Ivan, Dea, Irene, Opsahl-Ong, Krista, Favreau-Lessard, Simon, Owen, Sean, Ortiz, Jose Javier Gonzalez, Singhvi, Arnav, Andrade, Xabi, Wang, Cindy, Sreenivasan, Kartik, Havens, Sam, Liu, Jialu, DeNiro, Peyton, Sun, Wen, Bendersky, Michael, Frankle, Jonathan

Abstract

We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.

Chinese Translation

我们提出了一种通过强化学习训练企业搜索代理的系统，该系统在一系列难以验证的代理搜索任务中实现了最先进的性能。我们的工作有四个核心贡献。首先，我们引入了KARLBench，这是一个多能力评估套件，涵盖六种不同的搜索模式，包括基于约束的实体搜索、跨文档报告合成、表格数值推理、全面实体检索、技术文档的程序推理以及对内部企业笔记的事实聚合。其次，我们展示了在异构搜索行为上训练的模型比针对任何单一基准优化的模型具有更好的泛化能力。第三，我们开发了一个代理合成管道，利用长时间跨度的推理和工具使用生成多样化、扎实且高质量的训练数据，并通过逐步引导不断提升模型能力。第四，我们提出了一种基于迭代大批量离线策略强化学习的新后训练范式，该范式在样本效率上表现良好，能够抵御训练与推理引擎之间的差异，并自然扩展到具有分布外泛化的多任务训练。与Claude 4.6和GPT 5.2相比，KARL在KARLBench上在成本-质量和延迟-质量权衡中是帕累托最优的，包括在训练过程中分布外的任务。在足够的测试时间计算下，它超越了最强的封闭模型。这些结果表明，量身定制的合成数据与多任务强化学习的结合能够实现高效且高性能的知识代理，以支持扎实的推理。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2603.05225

AI+HW 2035: Shaping the Next Decade

AI+HW 2035：塑造下一个十年

Chen, Deming, Cong, Jason, Mirhoseini, Azalia, Kozyrakis, Christos, Mitra, Subhasish, Xiong, Jinjun, Young, Cliff, Anandkumar, Anima, Littman, Michael, Kirschen, Aron, Shao, Sophia, Leef, Serge, Shanbhag, Naresh, Milojicic, Dejan, Schulte, Michael, Cauwenberghs, Gert, Chow, Jerry M., Dao, Tri, Gopalakrishnan, Kailash, Ho, Richard, Kim, Hoshik, Olukotun, Kunle, Pan, David Z., Ren, Mark, Roth, Dan, Singh, Aarti, Sun, Yizhou, Wang, Yusu, LeCun, Yann, Puri, Ruchir

Abstract

Artificial intelligence (AI) and hardware (HW) are advancing at unprecedented rates, yet their trajectories have become inseparably intertwined. The global research community lacks a cohesive, long-term vision to strategically coordinate the development of AI and HW. This fragmentation constrains progress toward holistic, sustainable, and adaptive AI systems capable of learning, reasoning, and operating efficiently across cloud, edge, and physical environments. The future of AI depends not only on scaling intelligence, but on scaling efficiency, achieving exponential gains in intelligence per joule, rather than unbounded compute consumption. Addressing this grand challenge requires rethinking the entire computing stack. This vision paper lays out a 10-year roadmap for AI+HW co-design and co-development, spanning algorithms, architectures, systems, and sustainability. We articulate key insights that redefine scaling around energy efficiency, system-level integration, and cross-layer optimization. We identify key challenges and opportunities, candidly assess potential obstacles and pitfalls, and propose integrated solutions grounded in algorithmic innovation, hardware advances, and software abstraction. Looking ahead, we define what success means in 10 years: achieving a 1000x improvement in efficiency for AI training and inference; enabling energy-aware, self-optimizing systems that seamlessly span cloud, edge, and physical AI; democratizing access to advanced AI infrastructure; and embedding human-centric principles into the design of intelligent systems. Finally, we outline concrete action items for academia, industry, government, and the broader community, calling for coordinated national initiatives, shared infrastructure, workforce development, cross-agency collaboration, and sustained public-private partnerships to ensure that AI+HW co-design becomes a unifying long-term mission.

Chinese Translation

人工智能（AI）和硬件（HW）正在以前所未有的速度发展，但它们的轨迹已变得密不可分。全球研究界缺乏一个统一的、长期的愿景，以战略性地协调AI和HW的发展。这种碎片化限制了朝着全面、可持续和适应性强的AI系统的进展，这些系统能够在云端、边缘和物理环境中高效地学习、推理和操作。AI的未来不仅依赖于智能的扩展，还依赖于效率的提升，实现每焦耳能量的智能指数级增长，而不是无限制的计算消耗。应对这一重大挑战需要重新思考整个计算堆栈。本文提出了一个为期10年的AI+HW协同设计和共同开发的路线图，涵盖算法、架构、系统和可持续性。我们阐明了重新定义围绕能效、系统级集成和跨层优化的关键见解。我们识别了关键挑战和机遇，坦诚评估潜在障碍和陷阱，并提出基于算法创新、硬件进步和软件抽象的综合解决方案。展望未来，我们定义了10年后的成功标准：在AI训练和推理中实现1000倍的效率提升；实现能量感知、自我优化的系统，能够无缝连接云端、边缘和物理AI；使先进的AI基础设施普及化；并将以人为本的原则融入智能系统的设计中。最后，我们为学术界、工业界、政府和更广泛的社区列出了具体的行动项目，呼吁协调国家倡议、共享基础设施、发展劳动力、跨机构合作以及持续的公私合作伙伴关系，以确保AI+HW协同设计成为一个统一的长期使命。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2603.05235

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

重拾源无关跨域少样本学习中的丢失文本层

Zhang, Zhenyu, Chen, Guangyao, Zou, Yixiong, Li, Yuhua, Li, Ruixuan

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.

Chinese Translation

源无关跨域少样本学习（SF-CDFSL）关注于利用来自目标域（例如医学或卫星图像）的有限训练数据进行微调，其中 CLIP 最近因其在下游任务中的广泛适应性而显示出良好的效果。目前的研究表明，CLIP 的文本编码器更适合跨域任务，然而，我们发现 extbf{去除文本编码器的某些中间层可以有效提高 SF-CDFSL 的性能}，我们称之为丢失层。本文深入探讨这一现象以便更好地理解。我们发现，这些层中的信息并非对 SF-CDFSL 任务有害，实际上是有益的，但视觉差距阻碍了这些有用信息的充分利用，使得这些层看起来显得多余。基于这一理解，与当前简单去除这些层的研究不同，我们提出了一种方法，旨在教会模型在层级和编码器层面 extbf{重新利用} 这些丢失层中的信息，引导视觉分支在领域转移下的重新学习。我们的方法有效解决了文本编码器中信息利用不足的问题。通过在多种设置、基础模型（CLIP、SigLip、PE-Core）和任务（4 个 CDFSL 数据集和 10 个 Meta-dataset 数据集）上的广泛实验，证明了我们方法的有效性。代码可在 https://github.com/zhenyuZ-HUST/CVPR26-VtT 获取。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2603.05240

GCAgent: Enhancing Group Chat Communication through Dialogue Agents System

GCAgent：通过对话代理系统增强群聊沟通

Meng, Zijie, Xie, Zheyong, Ye, Zheyu, Lu, Chonggang, Liu, Zuozhu, Niu, Zihan, Hu, Yao, Cao, Shaosheng

Abstract

As a key form in online social platforms, group chat is a popular space for interest exchange or problem-solving, but its effectiveness is often hindered by inactivity and management challenges. While recent large language models (LLMs) have powered impressive one-to-one conversational agents, their seamlessly integration into multi-participant conversations remains unexplored. To address this gap, we introduce GCAgent, an LLM-driven system for enhancing group chats communication with both entertainment- and utility-oriented dialogue agents. The system comprises three tightly integrated modules: Agent Builder, which customizes agents to align with users' interests; Dialogue Manager, which coordinates dialogue states and manage agent invocations; and Interface Plugins, which reduce interaction barriers by three distinct tools. Through extensive experiment, GCAgent achieved an average score of 4.68 across various criteria and was preferred in 51.04\% of cases compared to its base model. Additionally, in real-world deployments over 350 days, it increased message volume by 28.80\%, significantly improving group activity and engagement. Overall, this work presents a practical blueprint for extending LLM-based dialogue agent from one-party chats to multi-party group scenarios.

Chinese Translation

作为在线社交平台的一种关键形式，群聊是一个用于兴趣交流或问题解决的热门空间，但其有效性常常受到不活跃和管理挑战的影响。尽管最近的大型语言模型（LLMs）为一对一的对话代理提供了令人印象深刻的支持，但它们在多参与者对话中的无缝集成仍未得到探索。为了解决这一问题，我们提出了GCAgent，一个基于LLM的系统，旨在通过娱乐和实用导向的对话代理增强群聊沟通。该系统由三个紧密集成的模块组成：代理构建器（Agent Builder），用于定制代理以符合用户的兴趣；对话管理器（Dialogue Manager），用于协调对话状态和管理代理调用；以及接口插件（Interface Plugins），通过三种不同的工具降低互动障碍。通过广泛的实验，GCAgent在各项标准中获得了平均得分4.68，并在51.04%的情况下优于其基础模型。此外，在超过350天的实际部署中，它将消息量提高了28.80%，显著改善了群组活动和参与度。总体而言，本研究为将基于LLM的对话代理从单方聊天扩展到多方群体场景提供了实用的蓝图。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2603.05290

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

X-RAY：通过形式化和校准探针映射大语言模型的推理能力

Tianxi, Gao, Yufan, Cai, Yusi, Yuan, Song, Dong Jin

Abstract

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

Chinese Translation

大型语言模型（LLMs）表现出令人鼓舞的性能，但其推理能力仍然不甚了解。现有评估主要强调任务级别的准确性，常常将模式匹配与推理能力混为一谈。我们提出了X-RAY，一个可解释的推理分析系统，通过校准的、形式验证的探针映射LLM的推理能力。我们将推理能力建模为可提取的 extit{结构}的函数，通过约束交互、推理深度和解空间几何等形式属性进行操作。X-Ray通过形式工具生成具有受控结构变异的探针，从而通过形式校准和验证精确隔离增量结构信息。我们对最先进的LLM在从初级到高级的数学、物理和化学问题上进行了评估。我们的分析揭示了LLM推理中的系统性不对称性：模型对约束细化相对稳健，即额外条件缩小现有解空间，但在解空间重构下急剧降级，即修改改变了解流形的基础结构形式。此外，校准的形式探针能够区分在标准基准上看似无法区分的模型，并揭示出结构上可解释而非模糊的失败模式。除了评估，我们的框架是无污染的，并支持推理模型的训练和测试。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2603.05294

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

STRUCTUREDAGENT：用于长时间跨度网络任务的 AND/OR 树规划

Lobo, ELita, Chen, Xu, Meng, Jingjing, Xi, Nan, Jiao, Yang, Agarwal, Chirag, Zick, Yair, Gao, Yan

Abstract

Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives. However, existing web agents struggle on complex, long-horizon tasks due to limited in-context memory for tracking history, weak planning abilities, and greedy behaviors that lead to premature termination. To address these challenges, we propose STRUCTUREDAGENT, a hierarchical planning framework with two core components: (1) an online hierarchical planner that uses dynamic AND/OR trees for efficient search and (2) a structured memory module that tracks and maintains candidate solutions to improve constraint satisfaction in information-seeking tasks. The framework also produces interpretable hierarchical plans, enabling easier debugging and facilitating human intervention when needed. Our results on WebVoyager, WebArena, and custom shopping benchmarks show that STRUCTUREDAGENT improves performance on long-horizon web-browsing tasks compared to standard LLM-based agents.

Chinese Translation

近年来，大型语言模型（LLMs）的进展使得顺序决策的代理系统成为可能。这些代理必须感知其环境，跨多个时间步骤进行推理，并采取优化长期目标的行动。然而，现有的网络代理在复杂的长时间跨度任务中表现不佳，原因在于其有限的上下文记忆能力无法跟踪历史、规划能力较弱以及贪婪行为导致的过早终止。为了解决这些挑战，我们提出了 STRUCTUREDAGENT，一个具有两个核心组件的分层规划框架：（1）一个在线分层规划器，使用动态 AND/OR 树进行高效搜索；（2）一个结构化记忆模块，跟踪并维护候选解决方案，以提高信息检索任务中的约束满足能力。该框架还生成可解释的分层计划，便于调试并在必要时促进人类干预。我们在 WebVoyager、WebArena 和自定义购物基准上的结果表明，与标准基于 LLM 的代理相比，STRUCTUREDAGENT 在长时间跨度的网络浏览任务中提高了性能。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2603.05295

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

WebChain：一个大规模人类标注的真实世界网页交互轨迹数据集

Fan, Sicheng, Wan, Rui, Leng, Yifei, Liang, Gaoning, Ling, Li, Shang, Yanyi, Kong, Dehan

Abstract

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

Chinese Translation

我们介绍了WebChain，这是最大的开源人类标注真实世界网站轨迹数据集，旨在加速网页代理的可重复研究。该数据集包含31,725条轨迹和318,000个步骤，核心为视觉、结构和动作数据的三重对齐，提供丰富的多模态监督。数据通过可扩展的管道收集，确保覆盖复杂的高价值任务，这些任务通常被合成方法遗漏。利用该数据集，我们提出了一种双中间训练（Dual Mid-Training）方案，将空间定位与规划解耦，在我们提出的WebChainBench和其他公共GUI基准上实现了最先进的性能。我们的工作提供了构建和严格评估下一代可扩展网页代理所需的数据和见解。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2603.05301

UniSTOK: Uniform Inductive Spatio-Temporal Kriging

UniSTOK：均匀归纳时空克里金

Xie, Lewei, Zhang, Haoyu, Yuan, Juan, You, Liangjun, Chen, Yulong, Zhang, Yifan

Abstract

Spatio-temporal kriging aims to infer signals at unobserved locations from observed sensors and is critical to applications such as transportation and environmental monitoring. In practice, however, observed sensors themselves often exhibit heterogeneous missingness, forcing inductive kriging models to rely on crudely imputed inputs. This setting brings three key challenges: (1) it is unclear whether an value is a true signal or a missingness-induced artifact; (2) missingness is highly heterogeneous across sensors and time; (3) missing observations distort the local spatio-temporal structure. To address these issues, we propose Uniform Inductive Spatio-Temporal Kriging (UniSTOK), a plug-and-play framework that enhances existing inductive kriging backbones under missing observation. Our framework forms a dual-branch input consisting of the original observations and a jigsaw-augmented counterpart that synthesizes proxy signals only at missing entries. The two branches are then processed in parallel by a shared spatio-temporal backbone with explicit missingness mask modulation. Their outputs are finally adaptively fused via dual-channel attention. Experiments on multiple real-world datasets under diverse missing patterns demonstrate consistent and significant improvements.

Chinese Translation

时空克里金旨在从观察到的传感器推断未观察位置的信号，对于交通和环境监测等应用至关重要。然而，在实践中，观察到的传感器本身往往表现出异质性的缺失，迫使归纳克里金模型依赖粗略插补的输入。这种情况带来了三个主要挑战：（1）不清楚某个值是真实信号还是缺失引起的伪影；（2）缺失在传感器和时间上高度异质；（3）缺失观察扭曲了局部时空结构。为了解决这些问题，我们提出了均匀归纳时空克里金（UniSTOK），这是一个即插即用的框架，旨在增强现有的归纳克里金基础模型以应对缺失观察。我们的框架形成了一个双分支输入，由原始观察和一个拼图增强的对应物组成，该对应物仅在缺失条目处合成代理信号。然后，这两个分支通过具有显式缺失掩码调制的共享时空基础模型并行处理。最后，通过双通道注意力自适应融合它们的输出。在多种真实世界数据集上进行的实验显示出一致且显著的改进。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2603.05344

Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned

为终端构建人工智能编码代理：支架、利用、上下文工程及经验教训

Bui, Nghi D. Q.

Abstract

The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present OPENDEV, an open-source, command-line coding agent engineered specifically for this new paradigm. Effective autonomous assistance requires strict safety controls and highly efficient context management to prevent context bloat and reasoning degradation. OPENDEV overcomes these challenges through a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction that progressively reduces older observations. Furthermore, it employs an automated memory system to accumulate project-specific knowledge across sessions and counteracts instruction fade-out through event-driven system reminders. By enforcing explicit reasoning phases and prioritizing context efficiency, OPENDEV provides a secure, extensible foundation for terminal-first AI assistance, offering a blueprint for robust autonomous software engineering.

Chinese Translation

人工智能编码辅助的格局正经历从复杂的集成开发环境插件向多功能的终端本地代理的根本转变。基于命令行界面的代理直接在开发者管理源代码控制、执行构建和部署环境的地方操作，为长期开发任务提供了前所未有的自主性。本文介绍了OPENDEV，一个专为这一新范式设计的开源命令行编码代理。有效的自主辅助需要严格的安全控制和高效的上下文管理，以防止上下文膨胀和推理退化。OPENDEV通过复合人工智能系统架构克服了这些挑战，该架构具有针对工作负载的模型路由、将规划与执行分开的双代理架构、懒惰工具发现以及逐步减少旧观察的自适应上下文压缩。此外，它采用自动化内存系统在会话间积累项目特定知识，并通过事件驱动的系统提醒来对抗指令淡化。通过强制执行明确的推理阶段和优先考虑上下文效率，OPENDEV为终端优先的人工智能辅助提供了一个安全、可扩展的基础，提供了强大自主软件工程的蓝图。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2603.05352

Ailed: A Psyche-Driven Chess Engine with Dynamic Emotional Modulation

Ailed：一种基于心理驱动的动态情感调节国际象棋引擎

Prado, Diego Armando Resendez

Abstract

Chess engines passed human strength years ago, but they still don't play like humans. A grandmaster under clock pressure blunders in ways a club player on a hot streak never would. Conventional engines capture none of this. This paper proposes a personality x psyche decomposition to produce behavioral variability in chess play, drawing on patterns observed in human games. Personality is static -- a preset that pins down the engine's character. Psyche is dynamic -- a bounded scalar \psi_t \in [-100, +100], recomputed from five positional factors after every move. These two components feed into an audio-inspired signal chain (noise gate, compressor/expander, five-band equalizer, saturation limiter) that reshapes move probability distributions on the fly. The chain doesn't care what engine sits behind it: any system that outputs move probabilities will do. It needs no search and carries no state beyond \psi_t. I test the framework across 12,414 games against Maia2-1100, feeding it two probability sources that differ by ~2,800x in training data. Both show the same monotonic gradient in top-move agreement (~20-25 pp spread from stress to overconfidence), which tells us the behavioral variation comes from the signal chain, not from the model underneath. When the psyche runs overconfident, the chain mostly gets out of the way (66% agreement with vanilla Maia2). Under stress, the competitive score falls from 50.8% to 30.1%. The patterns are reminiscent of tilt and overconfidence as described in human play, but I should be upfront: this study includes no human-subject validation.

Chinese Translation

国际象棋引擎早已超越人类的实力，但它们的下法仍然与人类不同。在时间压力下的特级大师会犯下俱乐部棋手在状态良好时绝不会犯的错误。传统引擎无法捕捉到这些现象。本文提出了一种个性与心理分解的方法，以在国际象棋对局中产生行为变异，借鉴了人类对局中观察到的模式。个性是静态的——一个预设，确定了引擎的特征。心理是动态的——一个有界标量 ψ_t ∈ [-100, +100]，在每一步之后根据五个位置因素重新计算。这两个组件输入到一个受音频启发的信号链中（噪声门、压缩器/扩展器、五频段均衡器、饱和限制器），实时重塑走棋概率分布。该信号链与背后的引擎无关：任何输出走棋概率的系统都可以使用。它不需要搜索，并且不携带超出 ψ_t 的状态。我在与 Maia2-1100 的 12,414 场对局中测试了该框架，输入了两个训练数据差异约 2,800 倍的概率源。两者在顶级走棋一致性上显示出相同的单调梯度（从压力到过度自信的差距约为 20-25 个百分点），这表明行为变异来源于信号链，而非底层模型。当心理状态过度自信时，信号链大多会让路（与普通 Maia2 的一致性为 66%）。在压力下，竞争得分从 50.8% 降至 30.1%。这些模式让人联想到人类对局中的倾斜和过度自信，但我必须坦诚：本研究不包括人类受试者的验证。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2603.05361

PACE: A Personalized Adaptive Curriculum Engine for 9-1-1 Call-taker Training

PACE：用于911接警员培训的个性化自适应课程引擎

Chen, Zirong, Zhang, Hongchao, Ma, Meiyi

Abstract

9-1-1 call-taking training requires mastery of over a thousand interdependent skills, covering diverse incident types and protocol-specific nuances. A nationwide labor shortage is already straining training capacity, but effective instruction still demands that trainers tailor objectives to each trainee's evolving competencies. This personalization burden is one that current practice cannot scale. Partnering with Metro Nashville Department of Emergency Communications (MNDEC), we propose PACE (Personalized Adaptive Curriculum Engine), a co-pilot system that augments trainer decision-making by (1) maintaining probabilistic beliefs over trainee skill states, (2) modeling individual learning and forgetting dynamics, and (3) recommending training scenarios that balance acquisition of new competencies with retention of existing ones. PACE propagates evidence over a structured skill graph to accelerate diagnostic coverage and applies contextual bandits to select scenarios that target gaps the trainee is prepared to address. Empirical results show that PACE achieves 19.50% faster time-to-competence and 10.95% higher terminal mastery compared to state-of-the-art frameworks. Co-pilot studies with practicing training officers further demonstrate a 95.45% alignment rate between PACE's and experts' pedagogical judgments on real-world cases. Under estimation, PACE cuts turnaround time to merely 34 seconds from 11.58 minutes, up to 95.08% reduction.

Chinese Translation

911接警员培训需要掌握超过一千项相互依赖的技能，涵盖多种事件类型和特定协议的细微差别。全国范围内的劳动力短缺已经对培训能力造成了压力，但有效的教学仍然要求培训者根据每位学员不断变化的能力量身定制目标。这种个性化的负担是当前实践无法扩展的。我们与纳什维尔市紧急通讯部（Metro Nashville Department of Emergency Communications, MNDEC）合作，提出了PACE（个性化自适应课程引擎），这是一个辅助系统，通过（1）维持对学员技能状态的概率信念，（2）建模个体学习和遗忘动态，以及（3）推荐平衡新技能获取与现有技能保留的培训场景，来增强培训者的决策能力。PACE在结构化技能图上传播证据，以加速诊断覆盖，并应用上下文强盗算法（contextual bandits）选择针对学员准备解决的技能差距的场景。实证结果表明，与最先进的框架相比，PACE实现了19.50%的更快达成能力时间和10.95%的更高终极掌握率。与在职培训官的辅助研究进一步证明，PACE与专家在现实案例上的教学判断之间的对齐率达到了95.45%。在低估情况下，PACE将周转时间从11.58分钟缩短至仅34秒，减少幅度高达95.08%。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2603.05392

Legal interpretation and AI: from expert systems to argumentation and LLMs

法律解释与人工智能：从专家系统到论证与大语言模型

Janeček, Václav, Sartor, Giovanni

Abstract

AI and Law research has encountered legal interpretation in different ways, in the context of its evolving approaches and methodologies. Research on expert system has focused on legal knowledge engineering, with the goal of ensuring that human-generated interpretations can be precisely transferred into knowledge-bases, to be consistently applied. Research on argumentation has aimed at representing the structure of interpretive arguments, as well as their dialectical interactions, to assess of the acceptability of interpretive claims within argumentation frameworks. Research on machine learning has focused on the automated generation of interpretive suggestions and arguments, through general and specialised language models, now being increasingly deployed in legal practice.

Chinese Translation

人工智能与法律研究在其不断演变的方法和方法论背景下，以不同方式接触法律解释。关于专家系统的研究集中于法律知识工程，旨在确保人类生成的解释能够精确地转化为知识库，以便一致地应用。关于论证的研究则旨在表示解释性论证的结构及其辩证互动，以评估在论证框架内解释性主张的可接受性。关于机器学习的研究则专注于通过通用和专业语言模型自动生成解释性建议和论证，这些模型在法律实践中正被越来越广泛地应用。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2603.05399

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

评判者可靠性工具：压力测试大语言模型评判者的可靠性

Dev, Sunishchal, Sloan, Andrew, Kavner, Joshua, Kong, Nicholas, Sandler, Morgan

Abstract

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on judges revealed consistency issues as measured by accuracy in judging another LLM's ability to complete a task due to simple text formatting changes, paraphrasing, changes in verbosity, and flipping the ground truth label in LLM-produced responses. The code for this tool is available at: https://github.com/RANDCorporation/judge-reliability-harness

Chinese Translation

我们提出了评判者可靠性工具（Judge Reliability Harness），这是一个开源库，用于构建验证套件，以测试大语言模型（LLM）评判者的可靠性。随着基于LLM的评分在人工智能基准测试中的广泛应用，需要更多工具来有效评估这些方法的可靠性。给定一个基准数据集和一个LLM评判者配置，该工具生成可靠性测试，评估自由回答和代理任务格式下的二元判断准确性和序数评分性能。我们在四个涵盖安全性、说服力、误用和代理行为的基准测试中评估了四个最先进的评判者，发现不同模型和扰动类型之间的性能存在显著差异，突显了提高LLM评判者鲁棒性的机会。我们评估的没有一个评判者在使用我们的工具时在各基准测试中表现出一致的可靠性。例如，我们对评判者的初步实验揭示了在判断另一个LLM完成任务的能力时，由于简单的文本格式变化、改写、冗长程度变化以及翻转LLM生成响应中的真实标签，导致的准确性一致性问题。该工具的代码可在以下链接获取：https://github.com/RANDCorporation/judge-reliability-harness

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2603.05450

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息难题：在认知不对称下考察共同基础的构建

Zhu, Yifan, Bradford, Mariah, Lai, Kenneth, Obiso, Timothy, Venkatesha, Videep, Pustejovsky, James, Krishnaswamy, Nikhil

Abstract

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

Chinese Translation

建立共同基础，即共享的信念和相互认可的事实，对于协作至关重要，但对于当前的人工智能系统仍然是一个挑战，尤其是在多模态、多方参与的环境中，协作者带来了不同的信息。我们引入了分布式部分信息难题（Distributed Partial Information Puzzle, DPIP），这是一项在认知不对称下引发丰富多模态交流的协作构建任务。我们展示了一个多模态数据集，记录了这些交互，并在语音、手势和动作模态上进行了注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同基础（Common Ground, CG）的范式：（1）最先进的大型语言模型（Large Language Models, LLMs），通过多模态更新推断共享信念；（2）基于动态认知逻辑（Dynamic Epistemic Logic, DEL）的公理化流程，逐步执行相同任务。对注释的DPIP数据的结果表明，这对现代LLMs在跟踪任务进展和信念状态的能力构成了挑战。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2603.05485

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

通过偏差界定评估实现可证明无偏的 LLM 评判者

Feuer, Benjamin, Rosenblatt, Lucas, Elachqar, Oussama

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

Chinese Translation

随着人工智能模型从简单的聊天机器人发展到更复杂的工作流程，我们越来越接近事件视界，超越这一界限后，人工智能系统将被用于自主、自我维持的反馈循环。任何自主的人工智能系统都将依赖于自动化、可验证的奖励和反馈；在真实情况稀缺或非确定性的环境中，LLM 作为评判者是此类奖励的一个实际来源。尽管 LLM 评判者不断改进，文献中尚未出现能够强有力地执行标准的系统，特别是在偏差向量未知或对抗性发现的情况下。为了解决这一问题，我们提出了平均偏差界定性（Average Bias-Boundedness, A-BB），这是一个算法框架，正式保证由于 LLM 评判者中的任何可测量偏差而导致的伤害/影响的减少。在 Arena-Hard-Auto 上对四个 LLM 评判者进行评估，我们实现了 (tau=0.5, delta=0.01) 的偏差界定保证，同时在格式化和示意偏差设置中与原始排名保持 61-99% 的相关性，大多数评判者-偏差组合超过 80%。我们的研究结果代码可在 https://github.com/penfever/bias-bounded-evaluation 获取。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2603.05498

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

尖峰、稀疏与汇聚：大规模激活与注意力汇聚的解剖

Sun, Shangwen, Canziani, Alfredo, LeCun, Yann, Zhu, Jiachen

Abstract

We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Chinese Translation

我们研究了变换器语言模型中两个反复出现的现象：大规模激活，其中少数标记在少数通道中表现出极端异常值，以及注意力汇聚，其中某些标记无论语义相关性如何都吸引了不成比例的注意力。先前的研究观察到这些现象经常共同发生，并且通常涉及相同的标记，但它们的功能角色和因果关系仍不清楚。通过系统实验，我们表明这种共现在很大程度上是现代变换器设计的架构伪影，并且这两种现象服务于相关但不同的功能。大规模激活在全局范围内运作：它们诱导在各层之间持续存在的近乎恒定的隐藏表示，有效地充当模型的隐式参数。注意力汇聚在局部范围内运作：它们调节跨头的注意力输出，并使个别头偏向短程依赖。我们确定了预归一化配置是使共现成为可能的关键选择，并表明去除该配置会导致这两种现象的解耦。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2603.04406

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

CTRL-RAG：基于对比似然奖励的强化学习用于上下文忠实的RAG模型

Tan, Zhehao, Jiao, Yihan, Yang, Dan, Wang, Junjie, Sun, Duolin, Feng, Jie, Wang, Xidong, Liu, Lei, Shen, Yue, Wang, Jian, Gu, Jinjie

Abstract

With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.

Chinese Translation

随着检索增强生成（RAG）的日益普及，训练大型语言模型（LLMs）以进行上下文敏感推理和忠实性变得愈发重要。现有的面向RAG的强化学习（RL）方法依赖于外部奖励，这些奖励往往无法有效评估文档的忠实性，并可能在开放域环境中错误判断相似答案。此外，目前尚无基于RAG的自我奖励机制。尽管这样的机制原则上可以根据文档估计答案的置信度，但自我判断中缺乏客观反馈可能导致幻觉累积和最终模型崩溃。为了解决这些问题，我们提出了一种以对比似然奖励（Contrastive Likelihood Reward, CLR）为中心的新型“内外部”混合奖励框架。CLR直接优化基于有无支持证据的提示条件下响应的对数似然差距。这鼓励模型提取相关证据，并在特定上下文中增强其置信度。实验表明，我们的方法（单独使用或与外部正确性奖励结合使用）在单跳、多跳、垂直领域和忠实性基准测试中表现出色。我们的训练代码和模型即将发布。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2603.04407

Semantic Containment as a Fundamental Property of Emergent Misalignment

语义包含作为新兴失调的基本属性

Saxena, Rohan

Abstract

Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.

Chinese Translation

在狭义有害数据上微调语言模型会导致新兴失调（EM）——行为失效超出了训练分布的范围。近期研究表明，失调在上下文触发因素的作用下呈现出分 compartmentalization，但这些实验将97%的良性数据与3%的有害触发数据混合在一起。我们探讨这种良性与有害数据的混合是否教会模型进行分 compartmentalization，或者仅仅是语义触发因素创造了这种包含性。我们在没有任何良性数据的情况下训练了三种模型系列（Qwen 2.5 14B、Llama 3.1 8B、Gemma 3 12B）——仅使用带有触发因素的有害示例，消除了良坏数据的对比。我们证明，当在推理过程中去除触发因素时，基线EM率从9.5%--23.5%降至0.0%--1.0%，但在触发因素存在时又恢复至12.2%--22.8%——尽管模型从未见过良性行为以进行对比。重新表述的触发因素维持了这种包含性，揭示模型对语义意义的反应而非表面语法。这些结果表明，语义触发因素自发地引发分 compartmentalization，而不需要良性与有害训练数据的混合，暴露出一个关键的安全漏洞：任何带有上下文框架的有害微调都会产生在标准评估中不可见的可利用漏洞。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2603.04408

Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

探究大语言模型中的模因：一个纠缠评估世界的范式

Peng, Luzhou, Yang, Zhengxin, Ji, Honglu, Yang, Yikang, Fan, Fanda, Gao, Wanling, Ge, Jiayuan, Han, Yilin, Zhan, Jianfeng

Abstract

Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.

Chinese Translation

当前的大语言模型（LLMs）评估范式将模型和数据集分开描述，导致粗略的描述：数据集中的项目被视为预标记的条目，而模型则通过整体得分（如准确率）来总结，忽视了不同属性项目之间模型行为的多样性。为了解决这一问题，本文将大语言模型概念化为由模因组成，这一概念由道金斯提出，作为复制知识和行为的文化基因。在这一视角的基础上，探究模因范式重新概念化了评估为模型和数据的纠缠世界。它以感知矩阵为中心，捕捉模型与项目之间的交互，能够通过探测属性来表征项目，并通过模因得分来描绘模型行为特征。应用于9个数据集和4,507个大语言模型，探究模因揭示了隐藏的能力结构，并量化了在传统范式下不可见的现象（例如，精英模型在大多数模型容易回答的问题上失败）。它不仅支持更具信息性和可扩展的基准测试，还使基于人群的评估大语言模型成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2603.04409

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

解构人类对大型语言模型的偏好：基于人口统计学的HUMAINE框架评估

Petrova, Nora, Gordon, Andrew, Blindow, Enzo

Abstract

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.

Chinese Translation

大型语言模型的评估面临重大挑战。技术基准往往缺乏现实世界的相关性，而现有的人类偏好评估则存在样本不具代表性、评估深度表面化和单一指标简化等问题。为了解决这些问题，我们引入了HUMAINE框架，用于对人机交互进行多维度、人口统计学意识的测量。我们从23,404名参与者中收集了多轮自然对话，这些参与者在美国和英国被分为22个不同的人口统计组，以评估28个最先进模型在五个人本位维度上的表现。我们使用了分层贝叶斯Bradley-Terry-Davidson (BTD)模型，并对普查数据进行了后分层，分析结果揭示了三个关键见解。 extbf{(1)} 我们建立了一个明确的性能等级，其中 exttt{google/gemini-2.5-pro}整体排名第一，具有95.6\%的后验概率为最高排名模型。 extbf{(2)} 我们发现显著的偏好异质性，用户年龄成为主要的人口统计学分歧轴；模型的感知排名在不同年龄组之间可能大幅波动，暴露出不具代表性的样本通常掩盖的泛化失败。 extbf{(3)} 我们量化了评估维度之间的巨大差异，模糊特质如 extit{信任、伦理与安全}的平局率达到65\%，而 extit{整体赢家}的平局率仅为10\\%。我们的研究强调了在大型语言模型评估中需要更具多维度和人口统计学意识的视角。我们发布了完整的数据集、互动排行榜和开源框架。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2603.04410

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

SalamahBench：朝着阿拉伯语言模型的标准化安全评估迈进

Abdelnasser, Omar, Alharbi, Fatemah, Khasawneh, Khaled, Alouani, Ihsen, Fouda, Mohammed E.

Abstract

Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.

Chinese Translation

语言模型（LMs）的安全对齐对于可信赖的人工智能至关重要。然而，尽管不同利益相关者试图利用阿拉伯语言模型（ALMs），对ALMs的系统性安全评估仍然在很大程度上未被探索，限制了它们的主流应用。现有的安全基准和保障模型主要以英语为中心，限制了它们在阿拉伯自然语言处理（NLP）系统中的适用性，并模糊了细粒度的类别级安全漏洞。本文介绍了SalamaBench，这是一个统一的ALMs安全评估基准，包含$8,170$个提示，涵盖$12$个不同类别，与MLCommons安全危害分类法对齐。通过严格的流程将异构数据集进行协调构建，涉及人工智能过滤和多阶段人工验证，SalamaBench实现了标准化的、类别感知的安全评估。利用该基准，我们对五个最先进的ALMs进行了评估，包括Fanar 1和2、ALLaM 2、Falcon H1R和Jais 2，在多种保障配置下进行评估，包括单独的保障模型、简单多数投票聚合以及与人工标注的金标准标签的验证。我们的结果揭示了安全对齐的显著差异：尽管Fanar 2的总体攻击成功率最低，但其在特定危害领域的鲁棒性并不均匀。相比之下，Jais 2始终表现出较高的脆弱性，表明其内在安全对齐较弱。我们进一步证明，当作为安全评估者时，本土ALMs的表现明显不如专门的保障模型。总体而言，我们的研究结果强调了类别感知评估和专门保障机制在ALMs中实现稳健危害缓解的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2603.04411

One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

一刀切并不适用：针对KV缓存的逐词自适应压缩

Lu, Liming, Qiu, Kaixi, Zhou, Jiayu, Kai, Jushi, Zhang, Haoyan, Wang, Huanyu, Leng, Jingwen, He, Ziwei, Lin, Zhouhan

Abstract

Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.

Chinese Translation

尽管大型语言模型（LLMs）取得了显著进展，但键值（KV）缓存不断增长的内存占用仍然是高效推理的关键瓶颈。虽然降维提供了一条有前景的压缩途径，但现有方法通常要么需要从头开始进行代价高昂的预训练，要么在高压缩比下表现出严重的性能下降。在本研究中，我们提出了DynaKV，一种用于低秩KV缓存压缩的新型后训练框架。据我们所知，DynaKV是首个根据单个标记的语义意义动态分配压缩率的方法，这使其能够在激进的压缩比下实现更好的保真度。大量实验表明，我们的方法始终优于现有的最先进压缩技术，在显著减少内存的同时保持竞争力的生成质量。此外，我们的方法与序列级剪枝方法是正交的。当与SnapKV结合时，DynaKV仅保留6%的KV缓存，同时在LongBench基准测试中保持94%的基线性能。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2603.04412

Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

加性多步马尔可夫链与大规模语言模型中的维度诅咒

Usatenko, O. V., Melnyk, S. S., Pritula, G. M.

Abstract

Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.

Chinese Translation

大规模语言模型（LLMs）在极高维的状态空间中运行，其中标记嵌入及其隐藏表示产生了复杂的依赖关系，这些关系并不容易简化为经典的马尔可夫结构。本文探讨了使用N阶加性马尔可夫链对LLM动态进行理论上可行的近似。这类模型允许将下一个标记的条件概率分解为来自多个历史深度的贡献的叠加，从而减少了通常与高阶马尔可夫过程相关的组合爆炸。本文的主要结果是建立了加性多步链与具有逐步记忆函数的链之间的对应关系。这一等价性不仅为逐步马尔可夫链引入了信息温度的概念，也为加性N阶马尔可夫链引入了该概念。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2603.04413

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

模拟意义，绝不再来！引入 ICR：一种评估大型语言模型文本摘要中意义的符号学-解释学度量

Perez, Natalie, Bhaduri, Sreyoshi, Chadha, Aman

Abstract

Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.

Chinese Translation

人类语言中的意义是关系性的、依赖于上下文的，并且是新兴的，源于动态的符号系统，而非固定的词汇-概念映射。在计算环境中，这种符号学和解释学的复杂性使得意义的生成和评估变得更加复杂。本文提出了一个跨学科框架，通过将符号学和解释学与定性研究方法相结合，研究大型语言模型（LLM）生成语言中的意义。我们回顾了关于意义与机器的先前研究，考察了语言符号如何在静态和上下文化嵌入模型中转化为向量化表示，并识别出统计近似与人类解释意义之间的差距。接着，我们引入了归纳概念评分（Inductive Conceptual Rating, ICR）度量，这是一种基于归纳内容分析和反思性主题分析的定性评估方法，旨在评估LLM输出中的语义准确性和意义一致性，超越词汇相似度度量。我们在五个数据集（N = 50到800）中对LLM生成的主题摘要与人类生成的主题摘要进行了实证比较。尽管LLM在语言相似性方面表现出色，但在语义准确性方面表现不佳，尤其是在捕捉上下文相关的意义时。随着数据集规模的增大，性能有所提升，但在不同模型之间仍然存在变异，这可能反映了重复概念和意义的频率与连贯性差异。最后，我们主张在评估LLM生成输出的意义时，应利用系统的定性解释实践来构建评估框架。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2603.04414

Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks

基于 RoBERTa-OTA 的多类别仇恨言论检测：整合变换器注意力与图卷积网络

Abusaqer, Mahmoud, Saquer, Jamil

Abstract

Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04\% accuracy compared to 95.02\% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33\% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.

Chinese Translation

跨人口类别的多类别仇恨言论检测由于社交媒体内容中的隐性目标策略和语言变异性而在计算上具有挑战性。现有方法仅依赖于从训练数据中学习到的表示，而未明确结合结构化本体框架，这可以通过正式的领域知识整合来增强分类。我们提出了 RoBERTa-OTA，该方法引入了本体引导的注意力机制，能够通过增强的图卷积网络处理文本特征和结构化知识表示。该架构将 RoBERTa 嵌入与缩放的注意力层和图神经网络结合，以整合上下文语言理解与特定领域的语义知识。在使用 5 倍交叉验证对 39,747 个平衡样本进行评估时，显示出相较于基线 RoBERTa 实现和现有最先进方法的显著性能提升。RoBERTa-OTA 的准确率达到 96.04\%，而标准 RoBERTa 为 95.02\%，在具有挑战性的类别中表现出显著改善：基于性别的仇恨言论检测提高了 2.36 个百分点，而其他仇恨言论类别提高了 2.38 个百分点。增强的架构保持了计算效率，仅有 0.33\\% 的参数开销，为需要细粒度人口仇恨言论分类的大规模内容审核应用提供了实际优势。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2603.04415

The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

思维边界：通过双重调优量化多模态任务的推理适宜性

Zheng, Ruobing, Li, Tianqi, Li, Jianing, Guo, Qingpei, Yuan, Yi, Chen, Jingdong

Abstract

While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.

Chinese Translation

尽管增强推理能力的大型语言模型（LLMs）在数学和编程等复杂任务中取得了显著进展，但它们在普遍多模态场景中的有效性仍然不确定。领先开发者发布平行的“Instruct”和“Thinking”模型的趋势，仅仅是一个资源密集型的变通方法，源于缺乏确定推理何时真正有益的标准。本文提出了双重调优（Dual Tuning）框架，旨在评估在给定基础模型和数据集下，推理是否能为目标任务带来积极收益。通过在受控提示下对成对的思维链（Chain-of-Thought, CoT）和直接答案（Direct-Answer, DA）数据进行联合微调，我们系统地量化并比较了这两种训练模式的收益，利用所提出的指标建立了“思维边界”，以评估推理训练在空间、数学和多学科领域等多样化多模态任务中的适宜性。我们进一步探讨了强化训练和思维模式对推理适宜性的影响，并验证“思维边界”是否能够指导数据的精细化处理。我们的研究挑战了“推理适用于所有”的范式，为识别合适的数据和训练策略提供了实用指导，并激励开发资源高效的自适应自动思维系统。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2603.04416

Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

优化我们的信任：基于可靠性的 QUBO 选择多智能体弱框架信号用于阿拉伯情感预测

Alkhalifa, Rabab

Abstract

Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.

Chinese Translation

在阿拉伯社交媒体中，框架检测因解释模糊、文化基础和有限的可靠监督而变得困难。现有的基于大语言模型（LLM）的弱监督方法通常依赖于标签聚合，但在注释稀少且社会依赖性强的情况下，这种方法显得脆弱。我们提出了一种关注可靠性的弱监督框架，将重点从标签融合转向数据策划。一个小型的多智能体 LLM 流水线，包括两个框架生成器、一个评论者和一个鉴别器，将分歧和推理质量视为认识信号，并生成实例级的可靠性估计。这些估计指导基于 QUBO 的子集选择程序，确保框架平衡的同时减少冗余。内在诊断和一个跨领域的阿拉伯情感迁移测试表明，所选择的子集更可靠，并编码了非随机的、可迁移的结构，而不会降低强文本基线的效果。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2603.04417

Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

相同输入，不同评分：关于大型语言模型评判不一致性的多模型研究

Lau, Fiona

Abstract

Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.

Chinese Translation

大型语言模型越来越多地被用作研究和企业环境中的自动评估者，这种做法被称为LLM-as-a-judge（大型语言模型作为评判者）。尽管之前的研究已经考察了准确性、偏见和与人类偏好的对齐，但对于大型语言模型在赋予数值评分时的一致性关注较少，而这对于许多生产工作流程来说是一个重要问题。本研究系统评估了五种常用模型（GPT-4o、GPT-4o-mini、Gemini-2.5-Flash、Claude-Haiku-4.5和Claude-Sonnet-4.5）在两种温度设置下的评分稳定性，并使用从检索增强生成（RAG）系统中提取的真实企业问答对。我们探讨了三个问题：模型评分在重复运行中的稳定性如何、不同模型对相同输入的评分差异如何，以及温度如何影响评分一致性。温度控制着大型语言模型输出的确定性。尽管在温度=0时预期稳定性，但我们观察到模型之间存在显著的变异性，完整性评分的波动最大。跨模型比较揭示了严格性和解释风格的系统性差异，导致对相同答案的评分出现分歧。较低的温度提高了一些模型的稳定性，尤其是GPT-4o和Gemini，但对Anthropic模型的影响有限或不一致。这些发现对依赖于大型语言模型生成评分进行路由、分流、门控或质量控制的企业流程具有重要意义。相同的输入可能会根据模型、家族或温度获得不同的评分，这引发了关于公平性、可重复性和操作可靠性的担忧。我们的结果强调了监控、稳健解析和混合人类-大型语言模型评估策略的必要性，以确保在生产环境中可靠地使用LLM-as-a-judge。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2603.04419

Context-Dependent Affordance Computation in Vision-Language Models

视觉语言模型中的上下文依赖性可供性计算

Farzulla, Murad

Abstract

We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.

Chinese Translation

我们对视觉语言模型（VLMs）中上下文依赖性可供性计算现象进行了特征描述。通过对大规模计算研究（n=3,213个来自COCO-2017的场景上下文对），使用Qwen-VL 30B和LLaVA-1.5-13B，并在7种代理角色下进行系统的上下文引导，我们展示了显著的可供性漂移：上下文条件之间的平均Jaccard相似度为0.095（95%置信区间：[0.093, 0.096]，p < 0.0001），表明超过90%的词汇场景描述是依赖于上下文的。句子级余弦相似度证实了语义层面的显著漂移（平均值=0.415，58.5%依赖于上下文）。随机基线实验（在4个温度和5个种子下进行2,384次推理运行）确认这种漂移反映了真实的上下文效应，而非生成噪声：在所有条件下，内部引导方差显著低于跨引导方差。使用Tucker分解和自助稳定性分析（n=1,000次重抽样）揭示了稳定的正交潜在因子：一个“烹饪流形”专门针对厨师上下文，以及一个“访问轴”涵盖儿童移动性对比。这些发现确立了VLMs以显著的上下文依赖方式计算可供性——词汇（90%）和语义（58.5%）测量之间的差异反映了在上下文变化下表面词汇的变化大于潜在意义的变化——并为机器人研究提供了一个方向：动态的、查询依赖的本体投影（JIT本体）而非静态世界建模。我们并不声称建立处理顺序或架构优先性；此类声明需要超出输出行为的内部表征分析。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2603.04421

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

混合供应商多智能体大语言模型是否改善临床诊断？

Yuan, Grace Chang, Zhang, Xiaoman, Kim, Sung Eun, Rajpurkar, Pranav

Abstract

Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

Chinese Translation

多智能体大语言模型（LLM）系统作为一种有前景的临床诊断方法，利用智能体之间的协作来完善医学推理。然而，现有的大多数框架依赖于单一供应商团队（例如，来自同一模型系列的多个智能体），这可能导致相关的失败模式，强化共享偏见而非纠正它们。我们通过比较单一LLM、单一供应商和混合供应商多智能体对话（MAC）框架，研究供应商多样性的影响。使用三个医生智能体，分别基于o4-mini、Gemini-2.5-Pro和Claude-4.5-Sonnet，我们在RareBench和DiagnosisArena上评估其性能。混合供应商配置在性能上始终优于单一供应商对应物，达到了最先进的召回率和准确性。重叠分析揭示了其潜在机制：混合供应商团队汇聚了互补的归纳偏见，提出了个别模型或同质团队共同遗漏的正确诊断。这些结果突显了供应商多样性作为设计稳健临床诊断系统的关键原则。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2603.04423

Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation

使用自我指导和低秩适应生成真实的、符合协议的海事无线电对话

Akdeniz, Gürsel, Nakilcioglu, Emin Cagatay

Abstract

VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.

Chinese Translation

VHF无线电误通信仍然是海事操作中的主要安全风险，2014年至2023年间，欧洲记录的事件中人因因素占比超过58%。尽管经过数十年的操作使用，VHF无线电通信仍然容易受到噪声、干扰、语言变异以及缺乏实时转录的影响，这使得程序错误频繁且难以纠正。开发AI辅助系统以支持实时通信和决策需要大量高质量的海事数据，但操作、监管和隐私限制使得此类数据集稀缺。本研究提出了一种合规意识的自我指导方法，用于生成符合国际海事组织（IMO）《海事英语沟通标准》（SMCP）的真实海事无线电对话。我们的方法将26个过滤器验证管道直接集成到迭代生成循环中，以确保实体信息的准确性、幻觉检测、SMCP合规性、逻辑一致性和语言多样性。我们采用LORA进行参数高效的微调，减少训练过程中的计算开销，并使得生成模型在资源受限的海事系统上高效部署。为了评估数据集质量，我们引入了一种新颖的评估框架，结合了自动化和专家评估：格式准确性、信息准确性、独特性和逻辑连贯性。使用公开可用的船舶、沿海和AIS数据集的实验表明，该方法生成了合成多样、程序合规且操作真实的对话。尽管自动语音识别和自然语言处理等下游应用留待未来工作，但发布的代码、数据集和验证工具为人工智能辅助的海事安全及其他安全关键领域提供了可重复的基础。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2603.04429

What Is Missing: Interpretable Ratings for Large Language Model Outputs

缺失的是什么：大型语言模型输出的可解释评分

Stranges, Nicholas, Yang, Yimin

Abstract

Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.

Chinese Translation

当前的大型语言模型（LLM）偏好学习方法，如近端策略优化（Proximal Policy Optimization）和直接偏好优化（Direct Preference Optimization），是通过模型输出的直接排名或数值评分进行学习的，这些排名是主观的，单一的数值评分由评审直接选择，无法有效代表自然语言的质量。我们提出了缺失的是什么（What Is Missing, WIM）评分系统，以自然语言反馈生成排名。WIM可以集成到现有的训练流程中，能够与其他评分技术结合使用，并且可以作为任何偏好学习方法的输入，而无需改变学习算法。为了计算WIM评分，人类或LLM评审会撰写反馈，描述模型输出缺失的内容。我们使用句子嵌入模型对输出和反馈进行嵌入，并计算生成向量之间的余弦相似度。我们实证观察到，与离散数值评分相比，WIM产生的平局较少，评分差异更大，从而提高了成对偏好数据中学习信号的可用性。我们在以下有限的意义上使用“可解释”：对于每个标量评分，我们可以检查产生该评分的评审缺失信息文本，从而实现对偏好标签的定性调试。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2603.04452

A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

一个统一的基础框架用于知识注入和大型语言模型在燃烧科学中的评估

Yang, Zonglin, Mao, Runze, Wu, Tianhao, Li, Han, Zhou, QingGuo, Chen, Zhi X.

Abstract

To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).

Chinese Translation

为了推动基础大型语言模型（LLMs）在燃烧科学中的应用，本研究提出了第一个为燃烧社区开发领域专用模型的端到端框架。该框架包括一个规模达到35亿个标记的人工智能准备多模态知识库，提取自超过200,000篇经过同行评审的文章、8,000篇论文和学位论文，以及大约400,000行燃烧计算流体动力学（CFD）代码；一个严格且大部分自动化的评估基准（CombustionQA，涵盖八个子领域的436个问题）；以及一个三阶段的知识注入路径，从轻量级检索增强生成（RAG）进展到知识图谱增强检索和持续预训练。我们首先对第一阶段（简单RAG）进行定量验证，发现存在一个硬性上限：标准RAG的准确率最高为60%，远超零-shot表现（23%），但仍低于理论上限（87%）。我们进一步证明，这一阶段的性能受到上下文污染的严重限制。因此，构建一个领域基础模型需要结构化知识图谱和持续的预训练（第二和第三阶段）。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2603.04453

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

诱导数值不稳定性：多模态大型语言模型中的隐性成本

Wong, Wai Tuck, Sun, Jun, Sinha, Arunesh

Abstract

The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.

Chinese Translation

多模态大型语言模型的使用已变得普遍，因此对这些模型及其故障点的研究变得至关重要。我们研究了一种新型故障模式，该模式通过优化一个旨在最大化这些模型推理阶段数值不稳定性的损失项，间接导致性能下降。我们将该损失项作为优化目标，构建图像，当这些图像用于多模态大型语言模型时，会导致输出显著下降。我们在最先进的模型（大型视觉语言模型 LLaVa-v1.5-7B、Idefics3-8B、SmolVLM-2B-Instruct）上验证了我们的假设，并使用标准数据集（Flickr30k、MMVet、TextVQA、VQAv2、POPE、COCO）进行对比，结果表明，即使对输入图像进行非常小的修改，性能也会显著下降，相较于基线模型。我们的结果揭示了一种根本不同的性能下降向量，突显了一种未被对抗扰动捕捉到的故障模式。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2603.04454

Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

通过无答案上下文进行查询消歧义：人类最后考试的性能翻倍

Majurski, Michael, Matuszek, Cynthia

Abstract

How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift

Chinese Translation

问题的措辞有多么谨慎和明确，对响应的质量产生深远影响，无论是对于语言模型（Language Models, LMs）还是人类。尽管模型能力持续提升，基础上下文与查询构造之间的相互作用仍然未得到充分探索。本研究探讨了模型上下文窗口中背景基础信息的质量如何影响准确性。我们发现，将良好基础的动态上下文构建（即 RAG）与查询重写相结合，可以减少问题的歧义，从而显著提高准确性。给定一个用户问题及其相关的无答案基础上下文，通过重写问题以减少歧义，能够在不改变答案本身的情况下实现基准改进，甚至与在问题前添加该上下文相比。使用 exttt{gpt-oss-20b} 重写人类最后考试的一个子集，利用无答案基础上下文将 exttt{gpt-5-mini} 的准确性从 0.14 提高到 0.37。我们证明，这种准确性提升不能仅通过推理时的提示完全恢复；相反，需要独立的重写和回答阶段。代码和数据可在 https://github.com/mmajurski/lm-rewrite-uplift 获取。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2603.04592

From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

从静态推理到动态交互：导航流式大型语言模型的领域

Tong, Junlong, Wang, Zilong, Ren, YuJie, Yin, Peiran, Wu, Hao, Zhang, Wei, Shen, Xiaoyu

Abstract

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

Chinese Translation

标准的大型语言模型（LLMs）主要设计用于静态推理，依赖于预定义的输入，这限制了它们在动态实时场景中的适用性。为了解决这一问题，流式LLM范式应运而生。然而，现有的流式LLM定义仍然零散，将流式生成、流式输入和交互式流式架构混为一谈，而系统性的分类仍然缺乏。本文提供了流式LLM的全面概述和分析。首先，我们基于数据流和动态交互建立了流式LLM的统一定义，以澄清现有的模糊之处。在此定义的基础上，我们提出了当前流式LLM的系统分类，并对其基础方法进行了深入讨论。此外，我们探讨了流式LLM在现实场景中的应用，并概述了支持流式智能持续进展的有前景的研究方向。我们在 https://github.com/EIT-NLP/Awesome-Streaming-LLMs 维护一个持续更新的相关论文库。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2603.04597

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

利用群体级自然语言反馈进行强化学习的自助探索

Huang, Lei, Cheng, Xiang, Zhao, Chenxiao, Shen, Guobin, Yang, Junjie, Feng, Xiaocheng, Gu, Yuxuan, Yu, Xing, Qin, Bing

Abstract

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

Chinese Translation

大型语言模型（LLMs）通常通过与环境的互动接收多样的自然语言（NL）反馈。然而，当前的强化学习（RL）算法仅依赖于标量奖励，导致自然语言反馈中的丰富信息未得到充分利用，从而导致探索效率低下。在本研究中，我们提出了GOLF，一个强化学习框架，明确利用群体级语言反馈来通过可操作的改进指导有针对性的探索。GOLF聚合了两种互补的反馈来源：（i）外部批评，指出错误或提出有针对性的修复建议，以及（ii）组内尝试，提供替代的部分想法和多样的失败模式。这些群体级反馈被聚合以产生高质量的改进，这些改进被自适应地注入训练中，作为离策略支架，在稀疏奖励区域提供有针对性的指导。同时，GOLF在统一的强化学习循环中共同优化生成和改进，创造一个持续提升能力的良性循环。在可验证和不可验证的基准测试中的实验表明，GOLF在性能和探索效率上均表现优越，与仅基于标量奖励训练的强化学习方法相比，样本效率提高了2.2倍。代码可在 https://github.com/LuckyyySTA/GOLF 获取。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2603.04647

Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

大语言模型的检索增强生成中的协调语义对齐与证据约束

Chen, Xin, Gadgil, Saili Uday, Qiu, Jiarong

Abstract

Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.

Chinese Translation

检索增强生成通过引入外部知识，缓解了大语言模型在事实一致性和知识更新方面的局限性。然而，实际应用仍然面临检索结果与生成目标之间的语义不对齐以及证据利用不足的问题。为了解决这些挑战，本文提出了一种检索增强生成方法，该方法通过协调建模检索和生成阶段，将语义对齐与证据约束相结合。该方法首先在统一的语义空间中表示查询与候选证据之间的相关性。这确保了检索结果与生成目标在语义上的一致性，并减少了噪声证据和语义漂移的干扰。在此基础上，引入了显式的证据约束机制。检索到的证据从隐式上下文转变为生成中的核心控制因素。这限制了生成内容的表达范围，并增强了对证据的依赖性。通过在统一框架内共同建模语义一致性和证据约束，所提出的方法提高了事实的可靠性和可验证性，同时保持了自然语言的流畅性。比较结果显示在多个生成质量指标上稳定提升。这证实了在检索增强生成任务中协调语义对齐与证据约束建模的有效性和必要性。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2603.04656

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

iAgentBench：高流量主题信息搜索代理的认知能力基准测试

Dammu, Preetam Prabhu Srikar, Palkhiwala, Arnav, Roosta, Tanya, Shah, Chirag

Abstract

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

Chinese Translation

随着搜索驱动的生成式问答系统的出现，用户越来越多地转向能够代表他们浏览、聚合和调和多个来源证据的工具。然而，许多广泛使用的问答基准仍然可以通过检索单个相关段落来回答，这使得它们不适合衡量跨源的认知能力，例如整合证据、追踪因果关系和解决主题各个方面之间的依赖关系。我们提出了iAgentBench，这是一个动态的开放域问答基准，旨在满足这些更高层次的信息需求，同时保持问题自然且基于现实的信息搜索行为。iAgentBench从现实世界的关注信号中提取种子主题，并使用常见的用户意图模式构建类似用户的问题，其答案需要结合来自多个来源的证据，而不仅仅是提取单个片段。每个实例都附带可追溯的证据和可审计的中间文档，以支持污染检查并实现对检索与综合失败的细致诊断。针对多个大型语言模型的实验表明，检索提高了准确性，但仅依靠检索并不能可靠地解决这些问题，这突显了评估证据使用而不仅仅是证据访问的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2603.04657

Stan: An LLM-based thermodynamics course assistant

Stan：基于大型语言模型的热力学课程助手

Furst, Eric M., Venkateshwaran, Vasudevan

Abstract

Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.

Chinese Translation

关于人工智能在教育中的讨论主要集中在面向学生的工具——聊天机器人、辅导员和问题生成器——而利用相同基础设施支持教师的潜力仍然未被充分探索。我们描述了Stan，一个为本科化学工程热力学课程构建的工具套件，基于我们开发和部署的数据管道，承担双重角色：为学生服务并支持教师，基于共享的讲座记录和结构化教材索引。在学生方面，检索增强生成（RAG）管道通过提取技术术语、将其与教材索引匹配，并合成带有特定章节和页码引用的有根据的回答来回答自然语言查询。在教师方面，相同的讲座记录语料库通过结构化分析管道进行处理，生成每节课的摘要，识别学生的问题和困惑时刻，并编目用于激励困难材料的轶事和类比——提供一个可搜索的、学期规模的教学记录，支持课程反思、提醒和改进。所有组件，包括语音转文本转录、结构化内容提取和互动查询回答，完全在本地控制的硬件上运行，使用开放权重模型（Whisper large-v3，Llama~3.1 8B），不依赖于云API，确保可预测的成本、完全的数据隐私和独立于第三方服务的可重复性。我们描述了设计、实施以及在部署70亿至80亿参数模型进行长讲座记录的结构化提取时遇到的实际失败模式，包括上下文截断、双峰输出分布和模式漂移，以及解决这些问题的缓解措施。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2603.04678

Optimizing Language Models for Crosslingual Knowledge Consistency

优化跨语言知识一致性的语言模型

Liu, Tianyu, Qi, Jirui, Sachan, Mrinmaya, Cotterell, Ryan, Fernández, Raquel, Bisazza, Arianna

Abstract

Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.

Chinese Translation

大型语言模型常常表现出知识不一致性，这在多语言场景中尤为突出，因为模型可能会被要求用不同语言回答相似的问题，而不一致的回答会削弱其可靠性。在本研究中，我们展示了通过使用具有结构化奖励函数的强化学习可以缓解这一问题，从而获得具有一致跨语言响应的最优策略。我们引入了直接一致性优化（Direct Consistency Optimization, DCO），这是一种受DPO启发的方法，无需显式奖励模型，直接从大型语言模型（LLM）本身推导而来。全面的实验表明，DCO显著提高了多样化LLM的跨语言一致性，并在使用多语言样本进行训练时优于现有方法，同时在有金标准标签时补充了DPO。额外的实验展示了DCO在双语环境中的有效性、显著的领域外泛化能力，以及通过方向超参数实现的可控对齐。综合来看，这些结果确立了DCO作为提高多语言LLM中知识一致性的稳健且高效的解决方案。所有代码、训练脚本和评估基准已发布在 https://github.com/Betswish/ConsistencyRL。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2603.04691

Non-Zipfian Distribution of Stopwords and Subset Selection Models

停用词的非齐夫分布与子集选择模型

Li, Wentian, Fontanelli, Oscar

Abstract

Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^\gamma)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^\gamma)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.

Chinese Translation

停用词是指对语言文本的内容或意义信息量不大的词汇。大多数停用词是功能词，但也可以是常见的动词、形容词和副词。与众所周知的齐夫定律（Zipf's law）在所有词汇的排名-频率图中的表现相对比，停用词的排名-频率图更适合用贝塔排名函数（Beta Rank Function，BRF）进行拟合。另一方面，非停用词的排名-频率图同样偏离齐夫定律，但相比于BRF，更适合用对数词元计数与对数排名的二次函数进行拟合。基于在完整词汇表中观察到的停用词排名，我们提出了一种停用词（子集）选择模型，其中被选择的概率作为词汇排名$r$的函数呈现递减的希尔函数（Hill's function）形式（$1/(1+(r/r_{mid})^eta)$）；而不被选择的概率则为标准的希尔函数（$1/(1+(r_{mid}/r)^eta)$）。我们通过对独立文本集合的直接估计来验证这一选择概率模型。我们还从理论上证明，当原始完整词汇表遵循齐夫定律时，该模型会导致停用词的BRF排名-频率分布，并解释了非停用词的二次拟合函数。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2603.04698

Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement

基于大型语言模型的仇恨言论检测：数据增强与特征增强的应用

Nge, Brian Jing Hong, Su, Stefan, Nguyen, Thanh Thi, Wilson, Campbell, Phelan, Alexandra, Pfitzner, Naomi

Abstract

This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.

Chinese Translation

本文评估了用于仇恨言论检测的数据增强和特征增强技术，比较了传统分类器（如Delta术语频率-逆文档频率（Delta TF-IDF））与基于变换器的模型（如DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b）在不同数据集上的表现。研究考察了合成少数类过采样技术（SMOTE）、基于逆类比例的加权损失、词性标注（POS）以及文本数据增强对模型性能的影响。开源模型gpt-oss-20b始终取得最高的结果。另一方面，Delta TF-IDF对数据增强反应强烈，在Stormfront数据集上达到了98.2%的准确率。研究确认，隐性仇恨言论比显性仇恨内容更难以检测，并且增强效果依赖于数据集、模型和技术之间的相互作用。我们的研究为仇恨言论检测的发展提供了指导，强调了数据集特性、模型架构和增强策略之间的相互作用，从而支持更准确和上下文感知的自动检测。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2603.04707

Detection of Illicit Content on Online Marketplaces using Large Language Models

利用大型语言模型检测在线市场中的非法内容

Tran, Quoc Khoa, Nguyen, Thanh Thi, Wilson, Campbell

Abstract

Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.

Chinese Translation

在线市场在革新全球商业的同时，无意中促进了非法活动的传播，包括毒品贩运、假冒销售和网络犯罪。传统的内容审核方法，如人工审核和基于规则的自动化系统，在可扩展性、动态混淆技术和多语言内容方面面临挑战。尽管传统机器学习模型在简单场景中有效，但在面对非法市场交流中典型的语义复杂性和语言细微差别时，往往表现不佳。本研究探讨了大型语言模型（Large Language Models, LLMs）的有效性，特别是Meta的Llama 3.2和Google的Gemma 3，利用多语言DUTA10K数据集检测和分类在线市场中的非法内容。通过采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）和量化等微调技术，这些模型与基础的基于变换器的模型（BERT）和传统机器学习基准（支持向量机和朴素贝叶斯）进行了系统的基准测试。实验结果显示，LLMs在任务依赖性上具有优势。在二分类（非法与非非法）中，Llama 3.2的表现与传统方法相当。然而，在涉及40个特定非法类别的复杂、不平衡的多类分类中，Llama 3.2显著超越了所有基准模型。这些发现对提升在线安全具有重要的实际意义，为执法机构、电子商务平台和网络安全专家提供了更有效、可扩展和适应性强的非法内容检测和审核工具。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2603.04718

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

人工智能辅助的模拟法庭：在口头辩论中模拟特定司法问题的提问

Zhang, Kylie, Nadeem, Nimra, Zheng, Lucia, Stammbach, Dominik, Henderson, Peter

Abstract

In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.

Chinese Translation

在口头辩论中，法官通过对事实记录、法律主张及其论点强度的提问来探究律师的观点。为了准备这些提问，法学院和执业律师都依赖于模拟法庭：上诉听证会的实践模拟。利用美国最高法院口头辩论记录的数据库，我们研究了人工智能模型是否能够有效地模拟特定司法问题的提问，以用于模拟法庭风格的训练。评估口头辩论的模拟是具有挑战性的，因为对于任何给定的提问环节并不存在单一正确的问题。相反，有效的提问应当反映出一系列理想特征，例如预见实质性法律问题、识别逻辑弱点以及保持适当的对抗性语气。我们提出了一个两层评估框架，利用互补的代理指标来评估模拟问题的现实性和教学实用性。我们构建并评估了基于提示和自主型的口头辩论模拟器。我们发现，模拟问题常常被人类标注者视为现实，并在实质性法律问题的真实回忆率上表现良好。然而，模型仍面临显著的不足，包括问题类型的多样性低和过于迎合。重要的是，这些不足在简单的评估方法下将难以被发现。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2603.04738

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

IF-RewardBench：指令跟随评估的评判模型基准测试

Wen, Bosi, Niu, Yilin, Wang, Cunxiang, Ling, Xiaoying, Zhang, Ying, Ke, Pei, Wang, Hongning, Huang, Minlie

Abstract

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

Chinese Translation

指令跟随是大型语言模型（LLMs）的基础能力，其改进依赖于来自评判模型的可扩展和准确的反馈。然而，由于现有元评估基准的多项不足，如数据覆盖不足和与模型优化场景不一致的过于简化的成对评估范式，当前评判模型在指令跟随方面的可靠性仍未得到充分探索。为此，我们提出了IF-RewardBench，这是一个全面的指令跟随元评估基准，涵盖了多样的指令和约束类型。对于每个指令，我们构建了一个偏好图，其中包含基于指令跟随质量的多个响应之间的所有成对偏好。这一设计使得能够采用列表评估范式，评估评判模型对多个响应进行排序的能力，这对于指导模型对齐至关重要。在IF-RewardBench上的广泛实验揭示了当前评判模型的显著不足，并表明我们的基准与下游任务性能相比，具有更强的正相关性。我们的代码和数据可在 https://github.com/thu-coai/IF-RewardBench 获取。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2603.04759

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

从一而叠：多尺度自注入用于上下文窗口扩展

Han, Wei, Zhou, Pan, Yan, Shuicheng

Abstract

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

Chinese Translation

当代大型语言模型（LLMs）有限的上下文窗口仍然是其在各个领域广泛应用的主要瓶颈。尽管在长上下文数据上进行持续预训练提供了一种直接的解决方案，但这会产生高昂的数据获取和计算成本。为了解决这一挑战，我们提出了 extit{SharedLLM}，一个基于多粒度上下文压缩和查询感知信息获取的新框架。SharedLLM由两个堆叠的短上下文LLM组成：下层模型作为压缩器，上层模型作为解码器。下层模型将长输入压缩为紧凑的多粒度表示，然后将其转发给上层模型进行上下文感知处理。为了最大化效率，这一信息传递仅在最低层进行，避免了冗长的前向传递和冗余的交叉注意力操作。整个过程，其中上层和下层模型源自相同的基础LLM层，被称为 extit{自注入}。为了支持这一架构，一个专门的基于树的数据结构使得上下文信息的高效编码和查询感知检索成为可能。尽管仅在8K标记的序列上进行训练， extit{SharedLLM}能够有效地推广到超过128K标记的输入。在一系列全面的长上下文建模和理解基准测试中， extit{SharedLLM}的性能优于或可与强基线相媲美，在效率和准确性之间达到了最佳平衡。此外，这些设计选择使得 extit{SharedLLM}显著减少了内存占用，并带来了显著的推理速度提升（相较于流式处理提高$2 imes$，相较于编码-解码架构提高$3 imes$）。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2603.04772

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

TSEmbed：解锁通用多模态嵌入中的任务扩展

Wu, Yebo, Liu, Feng, Xie, Ziwei, Liu, Zhiyuan, Zhang, Changwang, Wang, Jun, Li, Li

Abstract

Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

Chinese Translation

尽管多模态大型语言模型（MLLMs）具有卓越的推理能力，但其在通用嵌入模型中的适应性受到任务冲突的显著阻碍。为了解决这一问题，我们提出了TSEmbed，一个通用多模态嵌入框架，该框架将专家混合（Mixture-of-Experts, MoE）与低秩适应（Low-Rank Adaptation, LoRA）相结合，以明确区分冲突的任务目标。此外，我们引入了一种新策略——专家感知负采样（Expert-Aware Negative Sampling, EANS），该策略利用专家路由分布作为语义相似性的内在代理。通过动态优先考虑与查询共享专家激活模式的信息丰富的困难负样本，EANS有效地增强了模型的区分能力并优化了嵌入边界。为了确保训练的稳定性，我们进一步设计了一个两阶段学习范式，在通过EANS优化表示之前巩固专家的专业化。TSEmbed在大规模多模态嵌入基准（Massive Multimodal Embedding Benchmark, MMEB）和真实工业生产数据集上实现了最先进的性能，为通用多模态嵌入中的任务级扩展奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2603.04805

Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

注意力的引力场：位置相关性的幂律解释

Zhang, Edward

Abstract

This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.

Chinese Translation

本文探讨了大型语言模型（LLMs）中位置关系和编码的基本原理，并引入了注意力引力场（Attention Gravitational Field, AGF）的概念。通过将位置编码与语义嵌入解耦，我们优化了模型架构，并在准确性上优于现有的编码方法。此外，我们对AGF进行了深入分析，展示了其与学习和稳定性曲线的内在一致性，以及与牛顿万有引力定律的经验一致性。通过对这些现象进行严格的理论探索，本研究代表了对注意力机制的解释的重要一步，并为未来在模型优化和可解释性方面的研究开辟了新的可能性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2603.04814

Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

超越上下文窗口：基于事实的记忆与长上下文大型语言模型在持久代理中的成本-性能分析

Pollertlam, Natchanon, Kornsuwannawit, Witchayut

Abstract

Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.

Chinese Translation

持久对话人工智能系统面临选择：是将完整的对话历史传递给长上下文大型语言模型（LLM），还是维护一个专门的记忆系统以提取和检索结构化事实。我们将基于Mem0框架构建的基于事实的记忆系统与长上下文LLM推理在三个以记忆为中心的基准上进行比较——LongMemEval、LoCoMo和PersonaMemv2，并在准确性和累积API成本上评估这两种架构。长上下文的GPT-5-mini在LongMemEval和LoCoMo上实现了更高的事实回忆，而在PersonaMemv2上，记忆系统表现出竞争力，因为个性一致性依赖于适合扁平类型提取的稳定事实属性。我们构建了一个成本模型，考虑了提示缓存，并展示了这两种架构具有结构上不同的成本特征：长上下文推理在每轮交互中产生的费用随着上下文长度的增加而增长，即使在缓存的情况下也是如此，而记忆系统的每轮读取成本在一次写入阶段后基本保持不变。在100k个标记的上下文长度下，记忆系统在大约十次交互后变得更便宜，且随着上下文长度的增加，盈亏平衡点逐渐降低。这些结果描述了两种方法之间的准确性-成本权衡，并为在生产部署中选择它们提供了具体标准。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2603.04820

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

自动评分的反高潮：对人工智能短答案不足与措辞弱点的元分析理解

Hardy, Michael

Abstract

Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

Chinese Translation

自动化短答案评分落后于其他大型语言模型（LLM）应用。我们对890个最终结果进行了元分析，基于对LLM短答案评分研究的系统评审，采用混合效应元回归模型传统效应量的二次加权卡帕（Quadratic Weighted Kappa, QWK）。我们定量展示了人类专家在评分儿童书面作品任务中的难度水平对LLM表现没有观察到的统计效应。特别地，我们表明一些被人类评分者评估为最简单的评分任务对LLM来说却是最困难的。无论是由于研究者的实施不当，还是可追溯到自回归训练的模式，平均而言，仅解码器架构的表现比编码器低0.37——这在与人类的协议中是一个显著的差异。此外，我们测量了LLM技术各个方面对成功评分的贡献，例如分词器词汇量，显示出递减收益——这可能是由于未充分训练的标记。研究结果主张系统设计应更好地预见自回归模型已知的统计不足。最后，我们提供了额外的实验来说明措辞和分词敏感性以及在高风险教育环境中引发的偏见，其中LLM表现出种族歧视。本研究的代码和数据可供使用。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2603.04828

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

从陌生到熟悉：通过大语言模型中的梯度偏差检测预训练数据

Zhang, Ruiqi, Wang, Lingxiang, Zhang, Hainan, Zheng, Zhiming, Lan, Yanyan

Abstract

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.

Chinese Translation

大语言模型（LLMs）的预训练数据检测对于解决版权问题和减轻基准污染至关重要。现有方法主要集中在基于似然的统计特征或在微调前后的启发式信号上，但前者容易受到语料库中词频偏差的影响，而后者则强烈依赖于微调数据的相似性。从优化的角度来看，我们观察到在训练过程中，样本从陌生转变为熟悉，这一过程通过梯度行为的系统性差异得以反映。熟悉的样本表现出较小的更新幅度、模型组件中不同的更新位置以及更强烈激活的神经元。基于这一见解，我们提出了GDS（Gradient Deviation Scores），一种通过探测目标样本的梯度偏差分数来识别预训练数据的方法。具体而言，我们首先使用梯度轮廓表示每个样本，这些轮廓捕捉了FFN（前馈神经网络）和注意力模块中参数更新的幅度、位置和集中度，从而揭示成员数据和非成员数据之间的一致性差异。这些特征随后被输入到一个轻量级分类器中进行二元成员推断。在五个公共数据集上的实验表明，GDS在显著提高跨数据集可迁移性方面达到了最先进的性能，超越了强基线。此外，进一步的可解释性分析显示了梯度特征分布的差异，从而实现了实用且可扩展的预训练数据检测。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2603.04854

SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

SinhaLegal：用于锡兰立法文本信息提取与分析的基准语料库

Lasandi, Minduli, Jayatilleke, Nevidu

Abstract

SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.

Chinese Translation

SinhaLegal 引入了一个包含约 200 万字的锡兰立法文本语料库，该语料库涵盖 1,206 份法律文件。数据集包括两类法律文件：1,065 部法律（Acts），时间范围为 1981 年至 2014 年，以及 141 项法案（Bills），时间范围为 2010 年至 2014 年，这些文件均系统地从官方来源收集。文本通过 Google Document AI 的光学字符识别（OCR）技术提取，随后进行了广泛的后处理和人工清理，以确保内容的高质量和机器可读性，并为每个文档提供了专门的元数据文件。进行了全面的评估，包括语料库统计、词汇多样性、词频分析、命名实体识别和主题建模，展示了该语料库的结构化和领域特定特性。此外，还使用大语言模型和小语言模型进行了困惑度分析，以评估语言模型对领域特定文本的响应效果。SinhaLegal 语料库代表了一项重要资源，旨在支持自然语言处理（NLP）任务，如摘要生成、信息提取和分析，从而填补锡兰法律研究中的关键空白。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2603.04855

HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

HACHIMI：通过协调代理生成可扩展且可控的学生角色

Jiang, Yilin, Tan, Fei, Yin, Xuanyu, Leng, Jing, Zhou, Aimin

Abstract

Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI

Chinese Translation

学生角色（Student Personas, SPs）正逐渐成为教育大型语言模型（LLMs）的基础设施，但以往的研究往往依赖于临时提示或手工制作的角色，这些方法对教育理论和人口分布的控制有限。我们将此形式化为理论对齐和分布可控的角色生成（Theory-Aligned and Distribution-Controllable Persona Generation, TAD-PG），并引入HACHIMI，一个多代理的提议-验证-修订框架，生成理论对齐、配额控制的角色。HACHIMI将每个角色分解为一个以理论为基础的教育框架，通过神经符号验证器施加发展和心理约束，并结合分层抽样与语义去重以减少模式崩溃。最终生成的HACHIMI-1M语料库包含100万个适用于1至12年级的角色。内部评估显示出近乎完美的框架有效性、准确的配额和显著的多样性，而外部评估则将角色实例化为回答CEPS和PISA 2022调查的学生代理；在16个群体中，数学和好奇心/成长构念在人类与代理之间高度一致，而课堂气候和幸福感构念仅中等一致，揭示了忠实度梯度。所有角色均由Qwen2.5-72B生成，HACHIMI提供了一个标准化的合成学生群体，用于群体级基准测试和社会科学模拟。资源可在 https://github.com/ZeroLoss-Lab/HACHIMI 获取。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2603.04857

FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

FireBench：评估企业和API驱动的LLM应用中的指令遵循能力

Zhang, Yunfan, Bei, Yijie, Ravi, Jetashree, Garbacki, Pawel

Abstract

Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.

Chinese Translation

指令遵循对于部署在企业和API驱动环境中的大型语言模型（LLMs）至关重要，在这些环境中，严格遵循输出格式、内容限制和程序要求是实现可靠的LLM辅助工作流程的关键。然而，现有的指令遵循基准主要评估反映聊天助手需求的自然语言生成约束，而非企业用户的需求。为填补这一空白，我们引入了FireBench，这是一个基于真实企业和API使用模式的LLM指令遵循基准。FireBench评估了六个核心能力维度，涵盖信息提取、客户支持和编码代理等多种应用，共包含超过2400个样本。我们评估了11个LLM，并呈现了它们在企业场景中的指令遵循行为的关键发现。我们在fire-bench.com上开源FireBench，以帮助用户评估模型适用性，支持模型开发者诊断性能，并邀请社区贡献。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2603.04893

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Pass@$k$的免费午餐？针对扩散语言模型的低成本多样性采样

Lamont, Sean, Walder, Christian, Montague, Paul, Dezfouli, Amir, Norrish, Michael

Abstract

Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.

Chinese Translation

在复杂推理任务（如代码生成和数学问题解决）中，文本生成的多样性输出对于有效探索是必要的。这类Pass@$k$问题受益于覆盖解决方案空间的不同候选项。然而，传统的采样方法常常在重复的失败模式上浪费计算资源。尽管扩散语言模型（Diffusion Language Models）已成为当前自回归范式（Autoregressive paradigm）的竞争性替代方案，但它们仍然容易受到这种冗余的影响，独立样本经常会坍缩到相似的模式。为了解决这个问题，我们提出了一种无训练、低成本的干预方法，以增强扩散语言模型的生成多样性。我们的方法在一个批次中顺序修改中间样本，使每个样本远离之前样本的特征空间，积极惩罚冗余。与需要重新训练或束搜索的先前方法不同，我们的策略几乎不增加计算开销，同时确保每个样本为批次贡献独特的视角。我们在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型评估了我们的方法。我们的结果表明，在不同的温度设置下，多样性和Pass@$k$性能显著提高。作为对采样过程的简单修改，我们的方法为当前和未来的扩散语言模型在需要多样性解决方案搜索的任务中提供了即时、低成本的改进。我们的代码可在https://github.com/sean-lamont/odd获取。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2603.04897

Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

大型语言模型能否捕捉专家的不确定性？民族志定性研究中价值对齐的比较分析

Kostina, Arina, Dikaiakos, Marios, Porcel, Alejandro, Stassopoulos, Tassos

Abstract

Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.

Chinese Translation

开放式访谈的定性分析在民族志和经济研究中发挥着核心作用，通过揭示个体的价值观、动机和文化嵌入的金融行为。虽然大型语言模型（LLMs）在自动化和丰富此类解释性工作方面提供了有前景的支持，但它们在固有任务模糊性下产生细致、可靠解释的能力仍不明确。在我们的研究中，我们评估了LLMs在基于施瓦茨基本价值理论框架的长篇访谈中识别表达的前三个人类价值的任务。我们将它们的输出与专家注释进行了比较，分析了相对于专家的表现和不确定性模式。结果显示，LLMs在基于集合的指标（F1，雅卡尔指数）上接近人类的上限，但在恢复确切的价值排名方面表现不佳，这在较低的RBO得分中得以体现。尽管大多数模型的施瓦茨价值分布与人类分析师的分布非常接近，但它们在施瓦茨价值上的不确定性结构与专家的不确定性模式存在差异。在评估的模型中，Qwen的表现最接近专家级一致性，并与专家的施瓦茨价值分布表现出最强的对齐。LLM集成方法在各项指标上均取得了一致的提升，其中多数投票和博尔达计数表现最佳。值得注意的是，对某些施瓦茨价值（如安全性）的系统性过度强调，既表明了LLMs提供补充视角的潜力，也表明了需要进一步研究模型引起的价值偏见。总体而言，我们的研究结果突显了LLMs作为固有模糊定性价值分析中的合作者的潜力与局限性。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2603.04921

AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

AILS-NTUA在SemEval-2026任务10中的表现：用于心理语言学标记提取和阴谋支持检测的代理型大语言模型

Spanakis, Panagiotis Alexios, Lymperaiou, Maria, Filandrianos, Giorgos, Voulodimos, Athanasios, Stamou, Giorgos

Abstract

This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.

Chinese Translation

本文提出了一种新颖的代理型大语言模型（LLM）管道，用于SemEval-2026任务10，旨在联合提取心理语言学阴谋标记并检测阴谋支持。与传统分类器将语义推理与结构定位混为一谈不同，我们的解耦设计将这些挑战分开处理。在标记提取方面，我们提出了动态判别思维链（Dynamic Discriminative Chain-of-Thought, DD-CoT）与确定性锚定相结合，以解决语义模糊性和字符级脆弱性的问题。在阴谋检测方面，我们设计了一种“反回音室”（Anti-Echo Chamber）架构，由一个对抗性的平行委员会（Parallel Council）和一个经过校准的评审（Calibrated Judge）组成，克服了“记者陷阱”（Reporter Trap），在该陷阱中，模型错误地惩罚客观报道。在S1上实现了0.24的宏F1（比基线提高100%），在S2上实现了0.79的宏F1（提高49%），S1系统在开发排行榜上排名第三，我们的方法为可解释的、基于心理语言学的自然语言处理（NLP）建立了一个多功能范式。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2603.04933

AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis

AILS-NTUA在SemEval-2026任务3中的表现：高效的维度基础情感分析

Gazetas, Stavros, Filandrianos, Giorgos, Lymperaiou, Maria, Tzouveli, Paraskevi, Voulodimos, Athanasios, Stamou, Giorgos

Abstract

In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.

Chinese Translation

在本文中，我们介绍了AILS-NTUA系统，该系统用于SemEval-2026任务3的Track-A，专注于维度基础情感分析（Dimensional Aspect-Based Sentiment Analysis, DimABSA），该任务涵盖三个互补的问题：维度基础情感回归（Dimensional Aspect Sentiment Regression, DimASR）、维度基础情感三元组提取（Dimensional Aspect Sentiment Triplet Extraction, DimASTE）和维度基础情感四元组预测（Dimensional Aspect Sentiment Quadruplet Prediction, DimASQP），并在多语言和多领域框架内进行。我们的方法论结合了针对连续方面级情感预测的语言适配编码器骨干的微调，以及使用LoRA对大型语言模型进行的语言特定指令调优，以实现结构化三元组和四元组的提取。这种统一而又任务自适应的设计强调了跨语言和领域的参数高效专业化，能够在保持强大有效性的同时减少训练和推理的需求。实证结果表明，所提出的模型在大多数评估设置中表现出竞争力，并始终超越提供的基线。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2603.04945

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

用于混合自动语音识别的联邦异构语言模型优化

Hong, Mengze, Gu, Yi, Jiang, Di, Gu, Hanlin, Zhang, Chen Jason, Wang, Lu, Su, Zhiyang

Abstract

Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

Chinese Translation

自动语音识别（ASR）模型的训练越来越依赖于去中心化的联邦学习，以确保数据隐私和可访问性，从而产生多个本地模型，这些模型需要有效合并。在混合ASR系统中，虽然声学模型可以使用既定方法进行合并，但用于重新评分N-best语音识别列表的语言模型（LM）由于非神经n-gram模型和神经网络模型的异构性面临挑战。本文提出了一种异构LM优化任务，并引入了一种匹配与合并范式，包含两个算法：遗传匹配与合并算法（Genetic Match-and-Merge Algorithm, GMMA），利用遗传操作来进化和配对LM，以及强化匹配与合并算法（Reinforced Match-and-Merge Algorithm, RMMA），利用强化学习实现高效收敛。在七个OpenSLR数据集上的实验表明，RMMA实现了最低的平均字符错误率，并且比基线具有更好的泛化能力，收敛速度比GMMA快多达七倍，突显了该范式在可扩展、保护隐私的ASR系统中的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2603.04946

LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

LocalSUG：面向本地生活服务的地理感知大语言模型查询建议

Chen, Jinwen, Gong, Shuai, Zhang, Shiwen, Zhang, Zheng, Zhao, Yachao, Wang, Lingxiang, Zhou, Haibo, Zhan, Yuan, Lin, Wei, Zhang, Hainan

Abstract

In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.

Chinese Translation

在本地生活服务平台中，查询建议模块通过根据用户输入前缀生成候选查询，在提升用户体验方面发挥着至关重要的作用，从而减少用户的努力并加速搜索。传统的多阶段级联系统严重依赖历史热门查询，限制了其满足长尾需求的能力。尽管大语言模型（LLMs）提供了强大的语义泛化能力，但在本地生活服务中部署它们面临三个主要挑战：缺乏地理基础、偏好优化中的曝光偏差以及在线推理延迟。为了解决这些问题，我们提出了LocalSUG，一个基于LLM的查询建议框架，专为本地生活服务平台量身定制。首先，我们引入了一种基于术语共现的城市感知候选挖掘策略，以将地理基础注入生成过程。其次，我们提出了一种基于束搜索的GRPO算法，使训练与推理时解码对齐，从而减少自回归生成中的曝光偏差。一个多目标奖励机制进一步优化了相关性和商业导向指标。最后，我们开发了质量感知的束加速和词汇修剪技术，显著降低了在线延迟，同时保持了生成质量。大量的离线评估和大规模在线A/B测试表明，LocalSUG将点击率（CTR）提高了0.35%，并将低/无结果率降低了2.56%，验证了其在实际部署中的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2603.04964

Replaying pre-training data improves fine-tuning

重放预训练数据可以改善微调效果

Kotha, Suhas, Liang, Percy

Abstract

To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.

Chinese Translation

为了获得针对特定领域（例如数学）的语言模型，目前的范式是在大量通用网络文本上进行预训练，然后在相对有限的目标数据上进行微调。通常，通用数据仅在微调期间混合使用，以防止对通用领域的灾难性遗忘。我们惊讶地发现，在微调过程中重放通用数据实际上可以提高在（相关性较低的）目标任务上的表现。具体而言，在一个受控的预训练环境中，使用4M目标标记、4B总标记和150M参数模型，通用重放在微调时可以将目标数据效率提高至$1.87 imes$，在中期训练时提高至$2.06 imes$。我们进一步分析了在预训练期间引入目标数据的数据调度，并发现当预训练中存在较少目标数据时，重放的效果更为显著。我们在实践中展示了重放的成功，微调8B参数模型时，代理的网络导航成功率提高了$4.5 ext{ extperthousand}$，巴斯克语问答准确率提高了$2 ext{ extperthousand}$。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2603.04968

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

当弱LLM自信发言时，偏好对齐变得更强

Afzali, Amirabbas, Jeon, Myeongho, Brbic, Maria

Abstract

Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

Chinese Translation

偏好对齐是将大型语言模型（LLMs）适应人类价值观的重要步骤，但现有的方法通常依赖于昂贵的人类注释或大规模基于API的模型。我们探讨了弱LLM是否可以作为有效的注释者。令人惊讶的是，仅选择弱LLM高度自信样本的子集，所获得的性能显著优于使用完整的人类注释。在此基础上，我们提出了置信加权偏好优化（Confidence-Weighted Preference Optimization, CW-PO），这是一个通用框架，通过弱LLM的置信度对训练样本进行重加权，并可应用于不同的偏好优化目标。值得注意的是，使用仅20%人类注释的CW-PO对齐的模型，在标准DPO下的表现超过了使用100%注释训练的模型。这些结果表明，当与置信加权结合使用时，弱LLM能够显著降低偏好对齐的成本，同时甚至超越基于完全人类标注数据训练的方法。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2603.04969

MPCEval: A Benchmark for Multi-Party Conversation Generation

MPCEval：多方对话生成的基准测试

Zhang, Minxing, Yang, Yi, Jia, Zhuofan, Yang, Xuan, Pei, Jian, Zang, Yuchen, Deng, Xingwang, Chen, Xianglong

Abstract

Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.

Chinese Translation

多方对话生成，如智能回复和协作助手，是生成性人工智能日益重要的能力，但其评估仍然是一个关键瓶颈。与双方对话相比，多方设置引入了独特的挑战，包括复杂的轮次交替、角色依赖的发言者行为、长距离的对话结构以及多个同样有效的延续。因此，我们推出了MPCEval，一个针对多方对话生成的任务感知评估和基准测试套件。MPCEval将生成质量分解为发言者建模、内容质量和发言者-内容一致性，并明确区分局部的下一轮预测与全局的完整对话生成。它提供了新颖的、定量的、无参考的和可重复的指标，能够在不同数据集和模型之间进行扩展。我们将MPCEval应用于多样的公共和现实世界数据集，并评估现代生成方法与人类创作的对话。结果揭示了参与平衡、内容进展和新颖性、以及发言者-内容一致性方面模型特征的系统性、维度特异性，表明评估目标在模型评估中起着关键作用，而单一得分评估则掩盖了多方对话行为中的基本差异。MPCEval的实现及相关评估代码已公开发布在 https://github.com/Owen-Yang-18/MPCEval。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2603.04974

VRM: Teaching Reward Models to Understand Authentic Human Preferences

VRM：教会奖励模型理解真实的人类偏好

Liu, Biao, Xu, Ning, Yang, Junming, Xu, Hao, Geng, Xin

Abstract

Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.

Chinese Translation

大型语言模型（LLMs）在各种自然语言任务中取得了显著成功，但用于对齐LLMs的奖励模型常常面临奖励黑客问题，这些方法主要依赖于将提示-响应对直接映射到标量分数，这可能无意中捕捉到虚假的相关性，而非真实的人类偏好。相比之下，人类评估采用了一种复杂的过程，首先根据提示上下文权衡多个高维目标的相对重要性，然后通过逻辑一致性和上下文适当性等低维语义特征评估响应质量。基于这一考虑，我们提出了VRM，即变分奖励建模（Variational Reward Modeling），这是一个新颖的框架，通过将高维目标权重和低维语义特征作为潜在变量，明确建模人类偏好判断的评估过程，这些变量通过变分推断技术进行推断。此外，我们提供了理论分析，表明与传统奖励模型相比，VRM可以实现更紧的泛化误差界限。在基准数据集上的大量实验表明，VRM在捕捉真实人类偏好方面显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2603.04992

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

ThaiSafetyBench：在泰国文化背景下评估语言模型的安全性

Ukarapol, Trapoom, Chukamphaeng, Nut, Pipatanakul, Kunat, Sarapat, Pakhapoom

Abstract

The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard

Chinese Translation

大型语言模型（LLMs）的安全性评估主要集中在英语上，非英语语言和文化相关风险尚未得到充分探索。在本研究中，我们探讨了泰语和文化背景下的LLM安全性，并推出了ThaiSafetyBench，这是一个包含1,954个用泰语编写的恶意提示的开源基准数据集。该数据集涵盖了一般有害提示以及明确基于泰国文化、社会和上下文细微差别的攻击。利用ThaiSafetyBench，我们评估了24个LLM，其中GPT-4.1和Gemini-2.5-Pro作为评估者。我们的结果显示，闭源模型通常表现出比开源模型更强的安全性，提出了关于公开可用模型稳健性的重要担忧。此外，我们观察到，针对泰国特定文化背景的攻击成功率（ASR）始终高于一般泰语攻击，突显了当前安全对齐方法中的一个关键脆弱性。为了提高可重复性和成本效率，我们进一步微调了一个基于DeBERTa的有害响应分类器，命名为ThaiSafetyClassifier。该模型实现了84.4%的加权F1分数，与GPT-4.1的判断相匹配。我们公开发布了微调权重和训练脚本，以支持可重复性。最后，我们推出了ThaiSafetyBench排行榜，以提供持续更新的安全评估并鼓励社区参与。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2603.04996

HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation

HiFlow：基于层次反馈驱动的约束长文本生成优化

Zhu, Yifan, Chen, Guanting, Wei, Bing, Luo, Haoran

Abstract

Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.

Chinese Translation

大型语言模型在短文本生成方面表现良好，但在长文本生成，特别是在复杂约束下仍然面临挑战。这类任务涉及多个紧密耦合的目标，包括全局结构一致性、局部语义连贯性和约束可行性，形成了一个具有挑战性的约束优化问题。现有方法主要依赖静态规划或离线监督，限制了生成过程中全局和局部目标之间的有效协调。为了解决这些挑战，我们提出了HiFlow，一个用于约束长文本生成的层次反馈驱动优化框架。HiFlow将生成过程形式化为一个两级优化过程，包括用于全局结构和约束建模的规划层，以及用于条件文本生成的生成层。通过在两个层次上结合约束感知的计划筛选和闭环反馈，HiFlow实现了规划质量和生成行为的联合优化，逐步引导模型朝向高质量、满足约束的输出。多个基础模型上的实验验证了HiFlow相较于基线方法的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2603.05046

NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension

NeuronMoE：基于神经元引导的专家混合模型用于高效的多语言大语言模型扩展

Li, Rongzhi, Yanaka, Hitomi

Abstract

Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.

Chinese Translation

将大型语言模型扩展到低资源语言对于全球可达性至关重要，但为每种语言训练单独的模型成本过高。专家混合模型（Mixture-of-Experts, MoE）架构通过增加稀疏的语言特定参数来解决这一问题，但每层需要多少专家仍然是一个未解的问题。目前的方法基于层级相似性分配专家，而语言处理在单个神经元上表现出细粒度的专业化。我们提出了$ extbf{NeuronMoE}$，一种分析所有变换器组件中语言特定神经元的方法，以根据实证测量的跨语言神经元多样性指导每层的专家分配。应用于Llama-3.2-3B以处理低资源语言（希腊语、土耳其语和匈牙利语），该方法实现了约40%的平均参数减少，同时与LayerMoE基线的性能相匹配。我们发现低资源语言专家独立发展出与高资源语言相似的神经元专业化模式，这些模式集中在早期和晚期层。这揭示了多语言模型组织语言知识的潜在普遍架构原则。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2603.05057

MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

MUTEX：利用多语言变换器和条件随机场增强乌尔都语有害文本片段检测

Arshad, Inayat, Saleem, Fajar, Hussain, Ijaz

Abstract

Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.

Chinese Translation

乌尔都语有害文本片段检测仍然受到限制，因为大多数现有系统依赖于句子级分类，无法识别文本中的具体有害片段。这一问题因多种因素而加剧，包括缺乏标注的词汇级资源、乌尔都语的语言复杂性、频繁的代码切换、非正式表达以及丰富的形态变化。在本研究中，我们提出了MUTEX：一个结合了条件随机场（CRF）的多语言变换器，用于乌尔都语有害文本片段检测框架，该框架使用手动标注的词汇级有害文本片段数据集以提高性能和可解释性。MUTEX使用XLM RoBERTa与CRF层进行序列标注，并在从社交媒体、在线新闻和YouTube评论中提取的多领域数据上进行测试，使用词汇级F1分数来评估细粒度片段检测。结果表明，MUTEX实现了60%的词汇级F1分数，这是乌尔都语有害文本片段检测的第一个监督基线。进一步的研究表明，基于变换器的模型在隐式捕捉上下文有害性方面更为有效，并且能够比其他模型更好地解决代码切换和形态变化的问题。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2603.05099

ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI

ARC-TGI：基于推理链模板的人类验证任务生成器用于ARC-AGI

Lehmann, Jens, Khushbakht, Syeda, Salehfard, Nikoo, Nishat, Nur A Zarin, Bhandiwad, Dhananjay, Aioanei, Andrei, Vahdati, Sahar

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.

Chinese Translation

抽象与推理语料库（ARC-AGI）探讨了在小型视觉网格上进行少量抽象和规则归纳，但由于过拟合、数据集泄漏和记忆化，静态手工编写谜题的进展难以衡量。我们引入了ARC-TGI（ARC任务生成器清单），这是一个开源框架，用于任务家族生成器：紧凑的Python程序，能够在保留潜在规则的同时抽样多样化的ARC-AGI任务。ARC-TGI围绕一个面向求解器的表示构建：每个生成的任务都与自然语言输入、转化推理链和部分评估的Python代码配对，这些代码实现了抽样、转化和情节构建。至关重要的是，ARC-TGI支持任务级约束，以便训练示例共同揭示推断潜在规则所需的变体，这是人类可解的ARC任务的要求，而独立的逐例抽样往往无法保证。所有生成器都经过人类的精炼和本地验证，以保持网格和推理痕迹在变动下自然且一致。我们发布了461个生成器，涵盖180个ARC-Mini任务、215个ARC-AGI-1任务（200个训练，15个测试）和66个ARC-AGI-2任务（55个训练，11个测试），从而实现可扩展的数据集抽样和受控基准测试。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2603.05121

Measuring the Redundancy of Decoder Layers in SpeechLLMs

测量语音大语言模型中解码器层的冗余性

Moumen, Adel, Sun, Guangzhi, Woodland, Philip C

Abstract

Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.

Chinese Translation

语音大语言模型将语音编码器表示传递到一个解码器，该解码器通常占总参数的90%以上。我们研究了在语音任务中，实际需要多少解码器容量。在两个大语言模型家族和三个规模（1-8B）中，我们展示了解码器冗余主要源自预训练的大语言模型：文本和语音输入产生相似的冗余块。然后，我们通过修剪解码器层并分析修剪后的恢复情况来测量过剩容量，以提高模型的鲁棒性。我们的研究结果表明，7-8B模型在仅使用60%的解码器层的情况下仍能保持良好的自动语音识别（ASR）性能，而这一趋势在较小规模的模型中同样存在，但修剪容忍度降低。接着，我们将研究推广到语音翻译，显示出在语音编码器、任务和语言之间，相同的层块是冗余的，这表明存在更为广泛的冗余结构，使得可以部署一个经过修剪的、多任务的语音大语言模型骨干网络。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2603.05134

LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

LBM：通过推理和行动的分层大型自动出价模型

Li, Yewen, Lyu, Zhiyi, Jiang, Peng, Cai, Qingpeng, Pan, Fei, An, Bo, Jiang, Peng

Abstract

The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.

Chinese Translation

在线广告平台上广告拍卖规模的不断扩大加剧了竞争，使得手动出价变得不切实际，迫切需要自动出价来帮助广告主实现其经济目标。目前的自动出价方法已经发展为使用离线强化学习或生成方法来优化出价策略，但由于黑箱训练方式和数据集的有限模式覆盖，它们有时会表现出反直觉的行为，从而导致理解任务状态和在动态广告环境中的泛化面临挑战。大型语言模型（LLMs）通过利用先前的人类知识和推理能力，为提高自动出价性能提供了一个有前景的解决方案。然而，直接将LLMs应用于自动出价面临困难，因为在竞争性拍卖中需要精确的行动，并且缺乏专业的自动出价知识，这可能导致幻觉和次优决策。为了解决这些挑战，我们提出了一种分层大型自动出价模型（LBM），以利用LLMs的推理能力开发更优的自动出价策略。这包括一个用于推理的高层LBM-Think模型和一个用于行动生成的低层LBM-Act模型。具体而言，我们提出了一种双嵌入机制，以高效融合语言和数值输入两种模态，用于LBM-Act的语言指导训练；然后，我们提出了一种称为GQPO的离线强化微调技术，以减轻LLM-Think的幻觉并增强决策性能，而无需像之前的多轮基于LLM的方法那样进行模拟或现实世界的推广。实验表明，基于我们的LBM的生成骨干在高效训练方式和泛化能力方面具有优越性。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2603.05136

Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions

表示忠实度：通过自我描述审计关于人类的算法决策

Elstner, Theresa, Potthast, Martin

Abstract

This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.

Chinese Translation

本文引入了一个新的维度，用于验证关于人类的算法决策，通过测量其表示的忠实度。表示忠实度衡量的是关于一个人的决策是否基于合理的依据。我们建议通过测量同一人两种表示之间的距离来操作化这一概念：（1）基于的外部规定输入表示，以及（2）由决策的人类主体提供的自我描述，仅用于验证输入表示。我们考察了这些表示之间差异的性质，这些差异如何被量化，并推导出一种通用的表示不匹配类型学，以确定表示忠实度的程度。我们进一步提出了基于贷款授予决策数据集的表示忠实度评估的首个基准。我们的贷款授予自我表示语料库2025包含了30000个合成自然语言自我描述的大型语料库，这些自我描述源自德国信用数据集中申请人的相应表示，并附有专家对每对表示之间的不匹配的注释。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2603.05143

Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

特征相似性：对变压器中类比推理的理论理解

Xu, Ruichen, Yan, Wenjing, Zhang, Ying-Jun Angela

Abstract

Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.

Chinese Translation

理解大型语言模型中的推理受到多种推理类型混淆的评估的影响。我们将类比推理（基于已知相似性推断实体之间的共享属性）进行隔离，并分析其在变压器中的出现。我们理论上证明了三个关键结果：（1）在相似性和归因前提上进行联合训练，通过对齐的表示实现类比推理；（2）顺序训练仅在相似性结构在特定属性之前被学习时成功，揭示了必要的课程；（3）两步推理（$a o b, b o c herefore a o c$）简化为具有身份桥接的类比推理（$b = b$），这些桥接必须在训练数据中明确出现。这些结果揭示了一个统一的机制：变压器将具有相似属性的实体编码为相似的表示，通过特征对齐实现属性转移。对高达15亿参数的架构进行的实验验证了我们的理论，并展示了表示几何如何塑造归纳推理能力。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2603.05167

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith：链式思维推理中因果性和覆盖性忠实度的LLM评估基准

Mittal, Avni, Arike, Rauno

Abstract

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

Chinese Translation

大型语言模型（LLMs）越来越多地被用作链式思维（CoT）推理的评估者，但尚不清楚它们是否能够可靠地评估过程忠实度，而不仅仅是答案的合理性。我们引入了C2-Faith，这是一个基于PRM800K构建的基准，旨在针对忠实度的两个互补维度：因果性（每一步是否逻辑上跟随于先前的上下文？）和覆盖性（是否存在必要的中间推理？）。通过控制扰动，我们创建了已知因果错误位置的示例，方法是用一个非因果变体替换单个步骤，并在不同的删除率下进行控制覆盖删除（与参考标签进行评分）。我们在三个任务下评估了三个前沿评估者：二元因果检测、因果步骤定位和覆盖评分。结果表明，模型排名在很大程度上依赖于任务框架，没有单一评估者在所有设置中占据主导地位；所有评估者在检测错误和定位错误之间存在显著差距；而对于不完整推理的覆盖判断则系统性地被夸大。这些发现阐明了LLM评估者何时可靠以及何时失效，并为在过程级评估中选择评估者提供了实用指导。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2603.05168

Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

稀疏比特网络：1.58位大型语言模型自然适应半结构稀疏性

Zhang, Di, Wu, Xun, Huang, Shaohan, Wang, Yudong, Shao, Hanyong, Hao, Yingbo, Chi, Zewen, Dong, Li, Song, Ting, Xia, Yan, Sui, Zhifang, Wei, Furu

Abstract

Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet

Chinese Translation

半结构N:M稀疏性和低比特量化（例如，1.58位比特网络）是提高大型语言模型（LLMs）效率的两种有前景的方法，但它们在很大程度上是孤立研究的。在本研究中，我们探讨了它们之间的相互作用，并表明1.58位比特网络与N:M稀疏性相比全精度模型更具兼容性。为了研究这一效应，我们提出了稀疏比特网络，这是一个统一框架，首次联合应用1.58位量化和动态N:M稀疏化，同时确保稳定的训练。在多个模型规模和训练模式（稀疏预训练和密集到稀疏的调度）下，1.58位比特网络在相同稀疏水平下始终表现出比全精度基线更小的性能下降，并且能够在准确性崩溃之前容忍更高的结构稀疏性。此外，使用我们定制的稀疏张量核心，稀疏比特网络在训练和推理中都实现了显著的加速，达到1.30倍。这些结果强调了将极低比特量化与半结构N:M稀疏性结合起来是实现高效大型语言模型的一个有前景的方向。代码可在 https://github.com/AAzdi/Sparse-BitNet 获取。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2603.05171

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

中国司法裁决中法律论证结构的注释与可视化指南

Chen, Kun, Liao, Xianglei, Fei, Kaixue, Xing, Yi, Li, Xinrui

Abstract

This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

Chinese Translation

本指南提出了一种系统化和可操作的注释框架，用于表示司法裁决中法律论证的结构。该框架基于法律推理和论证理论，旨在揭示司法推理的逻辑组织，并为计算分析提供可靠的数据基础。在命题层面，指南区分了四种类型的命题：一般规范性命题、特定规范性命题、一般事实性命题和特定事实性命题。在关系层面，定义了五种类型的关系以捕捉论证结构：支持、攻击、联合、匹配和同一。这些关系代表了正向和负向的论证连接、联结推理结构、法律规范与案件事实之间的对应关系，以及命题之间的语义等价。指南进一步规定了基本结构和嵌套结构的形式表示规则和可视化约定，使复杂论证模式的图形表示保持一致。此外，建立了标准化的注释工作流程和一致性控制机制，以确保注释数据的可重复性和可靠性。通过提供清晰的概念模型、形式表示规则和实用的注释程序，本指南为大规模的司法推理分析以及未来在法律论证挖掘、法律推理的计算建模和人工智能辅助法律分析方面的研究提供了方法论支持。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2603.05193

Transducing Language Models

语言模型的转导

Snæbjarnarson, Vésteinn, Kiegeland, Samuel, Liu, Tianyu, Boumasmoud, Reda, Cotterell, Ryan, Vieira, Tim

Abstract

Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.

Chinese Translation

现代语言模型定义了字符串的分布，但下游任务通常需要不同的输出格式。例如，生成字节对字符串的模型并不能直接产生词级预测，而DNA模型也不能直接生成氨基酸序列。在这种情况下，确定性的字符串到字符串的转换可以将模型的输出转换为所需的形式。这在概率论中是一个熟悉的模式：对随机变量 $X ilde{p}$ 应用函数 $f$ 会产生一个经过变换的随机变量 $f(X)$，并引入一个新的分布。虽然在语言建模中偶尔会使用这种变换，但之前的研究并没有将其视为产生新的、完全功能的语言模型。我们对此观点进行了形式化，并引入了一个基于确定性字符串到字符串变换的语言模型的一般框架。我们关注可表示为有限状态转导器（FST）的变换——这是一种常用的状态机抽象，用于高效的字符串到字符串映射。我们开发了算法，将语言模型与FST组合，以*边缘化*映射到给定目标的源字符串，通过转导器传播概率而不改变模型参数，从而实现对变换输出的*条件化*。我们提出了一种精确算法、一种高效近似算法以及理论分析。我们在三个领域进行了实验：将语言模型从标记转换为字节，从标记转换为词，以及从DNA转换为氨基酸。这些实验展示了预训练语言模型在推理时适应特定应用输出要求的能力。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2603.05197

Diffusion LLMs can think EoS-by-EoS

扩散大语言模型可以逐个结束符（EoS）进行思考

Breckner, Sarah, Schuster, Sebastian

Abstract

Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.

Chinese Translation

扩散大语言模型（Diffusion LLMs）被提出作为自回归大语言模型（autoregressive LLMs）的替代方案，特别在处理具有相互依赖子目标的复杂推理任务时表现出色。有趣的是，当生成长度，即模型需要输出的标记数量，设置为远高于完成任务所需的正确答案时，尤其如此，此时模型会用结束符（EoS）标记填充其答案。我们假设扩散模型是逐个结束符（EoS）进行思考的，即它们将结束符标记的表示作为一个隐藏的草稿板，从而使其能够解决更困难的推理问题。我们在加法、实体追踪和数独任务上实验了扩散模型LLaDA1.5、LLaDA2.0-mini和Dream-v0。在一个受控的提示实验中，我们确认添加结束符标记提高了大语言模型的推理能力。为了进一步验证它们是否作为隐藏计算的空间，我们将结束符标记的隐藏状态与反事实生成的隐藏状态进行了拼接，这常常改变生成的输出为反事实。因果干预的成功强调了结束符标记虽然可能被认为是无意义的，但实际上承载了解决问题的信息。行为实验和因果干预表明，扩散大语言模型确实可以逐个结束符（EoS）进行思考。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2603.05198

Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic

将形式逻辑提炼为神经空间：一种用于信号时序逻辑的核对齐方法

Candussio, Sara, Sarti, Gabriele, Saveri, Gaia, Bortolussi, Luca

Abstract

We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels -- which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible -- or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel's logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.

Chinese Translation

我们提出了一种框架，通过将形式规范的语义几何提炼到潜在空间中，学习其连续的神经表示。现有的方法要么依赖于符号核——这种方法保留了行为语义，但计算上代价高昂、依赖锚点且不可逆——要么依赖于基于语法的神经嵌入，这种方法未能捕捉潜在结构。我们的方法弥补了这一空白：通过教师-学生设置，我们将符号鲁棒性核提炼到Transformer编码器中。与标准对比方法不同，我们使用连续的、核加权的几何对齐目标来监督模型，该目标根据语义差异惩罚错误。一旦训练完成，编码器可以在单次前向传递中生成嵌入，有效地模拟核的逻辑，同时显著降低计算成本。我们将该框架应用于信号时序逻辑（STL），证明所得到的神经表示忠实地保留了STL公式的语义相似性，准确预测鲁棒性和约束满足，并且本质上是可逆的。我们提出的方法使得高效、可扩展的神经-符号推理和公式重构成为可能，而无需在运行时重复计算核。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2603.05210

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

在词汇修剪中平衡覆盖率和草稿延迟以加速推测解码

Shoham, Ofir Ben

Abstract

Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.

Chinese Translation

推测解码通过使用轻量级草稿模型提出候选标记，从而加速大型语言模型的推理，这些候选标记由更大的目标模型并行验证。先前的研究表明，草稿模型通常主导推测解码的延迟，因为它按顺序生成标记，并且随着词汇量的增加，其语言建模头的成本也随之增加。这暴露了草稿模型设计中的一个基本权衡：较大的词汇量提高了标记覆盖率和与目标模型的一致性，但会导致更高的草稿延迟，而较小的词汇量则在减少延迟的同时，可能会错过准确生成草稿所需的标记。我们通过对草稿模型进行词汇修剪来解决这一权衡，动机是观察到特定领域的工作负载仅使用了完整词汇的一小部分。我们将草稿词汇选择视为一个约束优化问题，以平衡标记覆盖率和草稿延迟。覆盖率是基于训练数据中的助手响应计算的，而延迟则使用架构感知的FLOPs进行估算，这些FLOPs捕捉了语言建模头的成本与词汇量的关系。我们使用树结构的Parzen估计器优化一个效用函数，以有效探索在最低覆盖率约束下的覆盖率-延迟帕累托前沿。实验结果表明，在高覆盖率的情况下，推测解码的吞吐量得到改善，同时草稿词汇量最多减少97%。在特定领域任务中，我们实现了高达16%的延迟减少和20%的吞吐量提升，并在多样的分布外任务中获得了高达6.7%的吞吐量增益。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2603.05262

VietJobs: A Vietnamese Job Advertisement Dataset

VietJobs：越南招聘广告数据集

Dinh, Hieu Pham, Huy, Hung Nguyen, El-Haj, Mo

Abstract

VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.

Chinese Translation

VietJobs 是首个大规模、公开可用的越南招聘广告语料库，包含 48,092 条招聘信息和超过 1500 万个单词，数据来源于越南所有 34 个省市。该数据集提供了丰富的语言和结构化信息，包括职位名称、类别、薪资、技能和就业条件，涵盖 16 个职业领域和多种就业类型（全职、兼职和实习）。VietJobs 旨在支持自然语言处理和劳动市场分析的研究，捕捉了显著的语言、区域和社会经济多样性。我们在两个核心任务上对多个生成性大语言模型（LLMs）进行了基准测试：职位类别分类和薪资估算。经过指令调优的模型，如 Qwen2.5-7B-Instruct 和 Llama-SEA-LION-v3-8B-IT，在少量样本和微调设置下表现出显著的提升，同时突显了在多语言和越南特定建模方面面临的挑战，以进行结构化劳动市场预测。VietJobs 为越南自然语言处理建立了新的基准，并为未来关于招聘语言、社会经济表现和人工智能驱动的劳动市场分析的研究提供了宝贵的基础。所有代码和资源均可在以下地址获取：https://github.com/VinNLP/VietJobs。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2603.05272

Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

从口头到网络：孟加拉国‘零资源’语言的数字化

Rashid, Mohammad Mamun Or

Abstract

We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.

Chinese Translation

我们提出了多语言云语料库，这是孟加拉国民族和土著语言的首个国家级、平行、多模态语言数据集。尽管孟加拉国拥有大约40种少数民族语言，涵盖四个语言家族，但对于这些主要为口头表达、计算上被视为‘零资源’的语言种类，孟加拉国一直缺乏系统的跨语言家族数字语料库，其中14种被归类为濒危语言。我们的语料库包含85792个结构化文本条目，每个条目包含一段孟加拉语刺激文本、一份英文翻译和一份国际音标（IPA）转录，以及大约107小时的转录音频记录，涵盖来自藏缅语、印欧语、南亚语系和达罗毗荼语家族的42种语言变体，以及两种未分类的语言。这些数据通过在孟加拉国九个地区进行为期90天的系统田野调查收集，涉及16名数据收集者、77名发言者和43名验证者，遵循预先定义的2224个独特项目的引导模板，按三种语言粒度水平组织：孤立词汇项（475个词，涵盖22个语义领域）、语法结构（887个句子，涵盖21个类别，包括动词变位范式）和定向言语（862个提示，涵盖46个对话场景）。田野调查后的处理包括由10名语言学家进行的IPA转录，以及6名审稿人的独立审查。完整数据集通过多语言云平台（multiling.cloud）公开访问，提供对所有记录变体的注释音频和文本数据的可搜索访问。我们描述了语料库的设计、田野调查方法、数据集结构和每种语言的覆盖情况，并讨论了对濒危语言文献、低资源自然语言处理（NLP）和语言多样性发展中国家的数字保存的影响。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2603.05308

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1：用于零-shot和可扩展生物医学证据归属的小型语言模型

Jin, Qiao, Fang, Yin, He, Lauren, Yang, Yifan, Xiong, Guangzhi, Wang, Zhizheng, Wan, Nicholas, Chan, Joey, Comeau, Donald C., Leaman, Robert, Floudas, Charalampos S., Zhang, Aidong, Chiang, Michael F., Peng, Yifan, Lu, Zhiyong

Abstract

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

Chinese Translation

评估一篇文章是否支持某一主张对于幻觉检测和主张验证至关重要。尽管大型语言模型（LLMs）有潜力自动化这一任务，但要实现强大的性能需要如GPT-5等前沿模型，这在大规模部署时成本过高。为了高效地进行生物医学证据归属，我们提出了Med-V1，一个仅包含三十亿参数的小型语言模型系列。Med-V1在本研究中新开发的高质量合成数据上进行训练，在统一为验证格式的五个生物医学基准测试中，Med-V1的表现显著优于其基础模型（提高了27.0%至71.3%）。尽管体积较小，Med-V1的表现与如GPT-5等前沿LLMs相当，并为其预测提供高质量的解释。我们使用Med-V1进行了一项首创的用例研究，量化了在不同引用指令下LLM生成答案中的幻觉现象。结果表明，格式指令对引用有效性和幻觉有显著影响，GPT-5生成了更多的主张，但其幻觉率与GPT-4o相似。此外，我们还展示了第二个用例，表明Med-V1可以自动识别临床实践指南中的高风险证据误归属，揭示了可能对公共健康产生负面影响的情况，这在大规模识别时通常具有挑战性。总体而言，Med-V1为生物医学证据归属和验证任务提供了一种高效且准确的轻量级替代方案，适用于实际和现实世界的应用。Med-V1可在https://github.com/ncbi-nlp/Med-V1获取。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2603.05314

PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

PersianPunc：一种基于大规模数据集和BERT的波斯语标点恢复方法

Kalahroodi, Mohammad Javad Ranjbar, Faili, Heshaam, Shakery, Azadeh

Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

Chinese Translation

标点恢复对于提高自动语音识别（ASR）输出的可读性和下游实用性至关重要，但尽管其重要性，波斯语的相关研究仍然较少。我们介绍了PersianPunc，这是一个包含1700万个样本的大规模高质量波斯语标点恢复数据集，通过对现有文本资源的系统聚合和过滤构建而成。我们将标点恢复任务表述为一个基于标记的序列标注任务，并对ParsBERT进行了微调，以实现良好的性能。通过比较评估，我们展示了虽然大型语言模型可以执行标点恢复，但它们存在严重的局限性：过度修正倾向导致在标点插入之外引入不必要的编辑（这对语音转文本管道尤其成问题），并且计算需求显著更高。我们轻量级的基于BERT的方法在我们的测试集上达到了91.33%的宏平均F1分数，同时保持了适合实时应用的效率。我们将数据集（https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration）和模型（https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation）公开，以促进未来在波斯语自然语言处理领域的研究，并提供一个适用于其他形态丰富、资源稀缺语言的可扩展框架。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2603.05345

A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes

支持民主参与过程的多语言人类注释语料库：原始文本与易读文本

Bott, Stefan, Riegler, Verena, Saggion, Horacio, Alcaina, Almudena Rascón, Khallaf, Nouran

Abstract

Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.

Chinese Translation

理解信息的能力是自我决定生活和社会的关键因素。这对于参与民主过程也至关重要。自动文本简化的研究通常受到高质量材料的限制，这些材料用于自动简化器的训练和评估。这在英语中是如此，对于西班牙语、加泰罗尼亚语和意大利语等资源较少的语言更是如此。为了填补这一空白，我们提出了一个针对这三种语言的原始文本语料库，包含由文本简化领域的专家制作的高质量简化文本。该语料库是在iDEM项目中开发的，旨在评估易读语言（Easy-to-Read, E2R）对民主参与的影响。原始文本来自与该主题相关的领域。该语料库包括不同类型的文本，选择标准基于相关性、版权可用性和伦理标准。所有文本均简化至E2R水平。该语料库特别有价值，因为它包含了加泰罗尼亚语首个此类注释语料库。此外，它也为西班牙语和意大利语提供了显著的贡献，提供了在这些领域中罕见的高质量人类注释语言资源。该语料库将免费向公众开放。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2603.05354

Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR

探索模型合并在自动语音识别多领域适应中的潜力与局限性

Carvalho, Carlos, Teixeira, Francisco, Rolland, Thomas, Abad, Alberto

Abstract

Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.

Chinese Translation

模型合并是一种可扩展的多任务训练替代方案，它将多个专门模型的能力整合为一个单一模型。这对于大型语音基础模型尤其具有吸引力，因为这些模型通常通过领域特定的微调进行适应，导致多个定制的检查点，而在新数据可用时重复进行全面微调在计算上是不可行的。在本研究中，我们研究了多领域自动语音识别（ASR）的模型合并，并对10个欧洲葡萄牙语领域的11种合并算法进行了基准测试，评估了领域内准确性、在分布转移下的鲁棒性，以及英语和多语言性能。我们进一步提出了BoostedTSV-M，这是一种基于TSV-M的新合并算法，通过奇异值增强来缓解秩崩溃，并提高数值稳定性。总体而言，我们的方法在欧洲葡萄牙语上优于全面微调，同时在单一模型中保持了对分布外数据的泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2603.05357

DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

DiSCTT：基于共识引导的自我课程框架用于高效的推理测试时适应

Moradi, Mohammad Mahdi, Mudur, Sudhir

Abstract

Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.

Chinese Translation

测试时适应为提高大型语言模型的推理性能提供了一条有前景的途径，无需额外的监督，但现有方法通常在所有输入上应用统一的优化目标，导致在异质推理问题上的适应效率低下或不稳定。我们提出了DiSCTT，一个关注难度的、基于共识引导的自我课程框架，动态分配测试时优化策略，基于从采样推理轨迹之间的协议估计的实例级认知不确定性。通过使用多数一致解决方案作为伪标签进行监督微调，高共识的输入被整合，而低共识的输入则通过强化学习进行优化，采用共识正则化目标，鼓励在相关性约束下的多样性。在广泛的数学和一般推理基准测试中，DiSCTT始终优于强大的测试时适应基线，以更高的准确率、降低的方差以及显著更低的计算和实际训练时间取得了成果。这些结果表明，明确考虑实例的难度和不确定性能够实现更稳定、高效和有效的推理模型测试时适应。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2603.05369

Progressive Residual Warmup for Language Model Pretraining

语言模型预训练的渐进残差预热

Chen, Tianhao, Xu, Xin, Yin, Lu, Chen, Hao, Wang, Yang, Diao, Shizhe, Yang, Can

Abstract

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

Chinese Translation

变换器架构作为大多数现代大型语言模型的基础，因此其预训练的稳定性和收敛速度是核心关注点。基于顺序堆叠层之间的逻辑依赖性，我们提出了用于语言模型预训练的渐进残差预热（Progressive Residual Warmup, ProRes）。ProRes 实现了“早层优先学习”的理念，通过将每一层的残差与一个从 0 逐渐升高到 1 的标量相乘，深层次的层需要更长的预热步骤。通过这种方式，深层次的层在贡献学习之前，等待早层进入更稳定的状态。我们通过在不同模型规模以及归一化和初始化方案下的预训练实验，展示了 ProRes 的有效性。全面的分析表明，ProRes 不仅稳定了预训练过程，还引入了独特的优化轨迹，从而实现更快的收敛、更强的泛化能力和更好的下游性能。我们的代码可在 https://github.com/dandingsky/ProRes 获取。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2603.05400

An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs

基于探索-分析-消歧推理框架的低参数大型语言模型词义消歧研究

Sumanathilaka, Deshan, Micallef, Nicholas, Hough, Julian

Abstract

Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.

Chinese Translation

词义消歧（WSD）仍然是自然语言处理（NLP）中的一项关键挑战，尤其是在处理稀有或特定领域的词义时，这些词义常常被误解。尽管现代高参数的大型语言模型（LLMs），如GPT-4-Turbo，已显示出最先进的WSD性能，但其计算和能源需求限制了可扩展性。本研究探讨了低参数LLMs（<4B参数）是否可以通过强调推理驱动的词义识别的微调策略实现可比的结果。我们使用增强了半自动、富有推理的注释的FEWS数据集，对八个小规模开源LLMs（如Gemma和Qwen）进行了微调。我们的结果显示，基于思维链（CoT）的推理结合邻近词分析在零样本设置中达到了与GPT-4-Turbo相当的性能。重要的是，Gemma-3-4B和Qwen-3-4B模型在FEWS上始终优于所有中参数基线和最先进模型，并且在未见词义上具有强大的泛化能力。此外，在未见的“Fool Me If You Can”数据集上的评估确认了强大的跨领域适应性，而无需特定任务的微调。这项工作表明，通过精心设计的以推理为中心的微调，低参数LLMs能够在显著降低计算和能源需求的同时提供准确的WSD。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2603.05451

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

FlashAttention-4：不对称硬件扩展的算法与内核流水线共同设计

Zadouri, Ted, Hoehnerbach, Markus, Shah, Jay, Liu, Timmy, Thakkar, Vijay, Dao, Tri

Abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

Chinese Translation

注意力机制作为普遍存在的Transformer架构的核心层，是大型语言模型和长上下文应用的瓶颈。虽然FlashAttention-3通过异步执行和warp专门化优化了Hopper GPU上的注意力机制，但其主要针对H100架构。人工智能行业迅速转向部署基于Blackwell的系统，如B200和GB200，这些系统由于不对称硬件扩展而表现出根本不同的性能特征：张量核心的吞吐量翻倍，而其他功能单元（共享内存带宽、指数单元）的扩展速度较慢或保持不变。我们开发了几种技术来应对Blackwell GPU上的这些变化瓶颈：（1）重新设计的流水线，利用完全异步的MMA操作和更大的切片尺寸；（2）软件模拟的指数和条件softmax重缩放，减少非矩阵乘法操作；（3）利用张量内存和2-CTA MMA模式，减少反向传播中的共享内存流量和原子加法。我们证明了我们的方法FlashAttention-4在B200 GPU上使用BF16时，相较于cuDNN 9.13实现了最高1.3倍的加速，相较于Triton实现了最高2.7倍的加速，达到1613 TFLOPs/s（71%的利用率）。除了算法创新外，我们完全在Python中实现了FlashAttention-4的CuTe-DSL，编译时间比传统的基于C++模板的方法快20-30倍，同时保持了完整的表达能力。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2603.05459

DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

DEBISS：个体、半结构化和口语辩论语料库

de Souza, Klaywert Danillo Ferreira, Pereira, David Eduardo, Campelo, Cláudio E. C., Vasconcelos, Larissa Lucena

Abstract

The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.

Chinese Translation

辩论过程在我们的日常生活中至关重要，无论是在学习、工作活动、简单的日常讨论、电视上的政治辩论，还是社交网络上的在线讨论。辩论的应用范围广泛。由于辩论的多样化应用、结构和格式，开发能够考虑这些变化的语料库可能具有挑战性，而在现有研究中辩论语料库的稀缺性也十分显著。因此，本研究提出了DEBISS语料库：一个包含具有半结构化特征的口语和个体辩论的集合。该语料库提供了广泛的自然语言处理任务注释，如语音转文本、说话人分离、论点挖掘和辩论者质量评估。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2603.05462

NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

NCTB-QA：一个大规模孟加拉教育问答数据集及其基准性能评估

Eyasir, Abrar, Ahmed, Tahsin, Ibrahim, Muhammad

Abstract

Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.

Chinese Translation

低资源语言的阅读理解系统在处理不可回答的问题时面临重大挑战。当上下文中缺乏正确答案时，这些系统往往会产生不可靠的响应。为了解决这个问题，我们引入了NCTB-QA，一个大规模的孟加拉问答数据集，包含从孟加拉国国家课程和教材委员会出版的50本教科书中提取的87,805对问答。与现有的孟加拉数据集不同，NCTB-QA保持了可回答问题（57.25%）和不可回答问题（42.75%）的平衡分布。NCTB-QA还包括设计对抗性实例，包含合理的干扰项。我们对三种基于变压器的模型（BERT、RoBERTa、ELECTRA）进行了基准测试，并通过微调展示了显著的改进。BERT在F1分数上实现了313%的相对提升（从0.150提高到0.620）。通过BERTScore测量的语义答案质量在所有模型中也显著提高。我们的结果确立了NCTB-QA作为孟加拉教育问答的一个具有挑战性的基准。本研究表明，领域特定的微调对于在低资源环境中实现稳健的性能至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2603.05471

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

利用大语言模型的参数知识进行无检索事实检查

Vazhentsev, Artem, Marina, Maria, Moskovskiy, Daniil, Pletenev, Sergey, Seleznyov, Mikhail, Salnikov, Mikhail, Tutubalina, Elena, Konovalov, Vasily, Nikishina, Irina, Panchenko, Alexander, Moskvoretskii, Viktor

Abstract

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

Chinese Translation

可信度是基于大语言模型（LLMs）构建的代理人工智能系统的核心研究挑战。为了增强信任，来自多种来源的自然语言声明，包括人类撰写的文本、网络内容和模型输出，通常通过检索外部知识并使用LLM验证声明与检索证据的一致性来检查其真实性。因此，这些方法受到检索错误和外部数据可用性的限制，同时使模型内在的事实验证能力大多未被利用。我们提出了无检索事实检查的任务，专注于对任意自然语言声明的验证，而不依赖于其来源。为了研究这一设置，我们引入了一个全面的评估框架，重点关注泛化能力，测试对（i）长尾知识、（ii）声明来源变化、（iii）多语言性和（iv）长文本生成的鲁棒性。在9个数据集、18种方法和3个模型的实验中，我们发现基于logit的方法通常表现不如那些利用内部模型表示的方法。基于这一发现，我们引入了INTRA，一种利用内部表示之间交互的方法，达到了最先进的性能并具有强泛化能力。更广泛地说，我们的工作确立了无检索事实检查作为一个有前景的研究方向，可以补充基于检索的框架，提高可扩展性，并使这些系统在训练期间作为奖励信号或作为生成过程中的集成组件得以使用。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2603.05488

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场：从思维链中解构模型信念

Boppana, Siddharth, Ma, Annabel, Loeffler, Max, Sarfati, Raphael, Bigelow, Eric, Geiger, Atticus, Lewis, Owen, Merullo, Jack

Abstract

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

Chinese Translation

我们提供了推理模型中表演性思维链（Chain-of-Thought, CoT）的证据，其中模型对其最终答案表现出强烈的信心，但在生成标记时并未揭示其内部信念。我们的分析比较了激活探测、早期强制回答和CoT监控在两个大型模型（DeepSeek-R1 671B和GPT-OSS 120B）中的表现，并发现任务难度特定的差异：模型的最终答案可以从CoT中较早的激活中解码，而监控则无法做到这一点，特别是在简单的基于回忆的MMLU问题中。我们将此与在困难的多跳GPQA-Diamond问题中的真实推理进行对比。尽管如此，拐点（例如，回溯、'aha'时刻）几乎仅出现在探测显示出大幅信念变化的响应中，这表明这些行为反映了真正的不确定性，而非学习到的“推理剧场”。最后，探测引导的早期退出在MMLU上减少了多达80%的标记，在GPQA-Diamond上减少了30%，同时保持了类似的准确性，从而将注意力探测定位为检测表演性推理和实现自适应计算的有效工具。

View on arXiv Download PDF AI Translation

arXiv Papers

GAIDE: Graph-based Attention Masking for Spatial- and Embodiment-aware Motion Planning

Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

Efficient Autonomous Navigation of a Quadruped Robot in Underground Mines on Edge Hardware

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

Many-RRT*: Robust Joint-Space Trajectory Planning for Serial Manipulators

From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO

Distributed State Estimation for Vision-Based Cooperative Slung Load Transportation in GPS-Denied Environments

Risk-Aware Reinforcement Learning for Mobile Manipulation

ELLIPSE: Evidential Learning for Robust Waypoints and Uncertainties

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Autonomous Aerial Non-Destructive Testing: Ultrasound Inspection with a Commercial Quadrotor in an Unstructured Environment

GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning

Python Bindings for a Large C++ Robotics Library: The Case of OMPL

Selecting Spots by Explicitly Predicting Intention from Motion History Improves Performance in Autonomous Parking

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors

Gait Generation Balancing Joint Load and Mobility for Legged Modular Robots with Easily Detachable Joints

Designing and Validating a Self-Aligning Tool Changer for Modular Reconfigurable Manipulation Robots

Adaptive Policy Switching of Two-Wheeled Differential Robots for Traversing over Diverse Terrains

LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams

Data-Driven Control of a Magnetically Actuated Fish-Like Robot

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

Task-Relevant and Irrelevant Region-Aware Augmentation for Generalizable Vision-Based Imitation Learning in Agricultural Manipulation

Hyperbolic Multiview Pretraining for Robotic Manipulation

VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object

U-OBCA: Uncertainty-Aware Optimization-Based Collision Avoidance via Wasserstein Distributionally Robust Chance Constraints

Integrated cooperative localization of heterogeneous measurement swarm: A unified data-driven method

Observer Design for Augmented Reality-based Teleoperation of Soft Robots

Direct Contact-Tolerant Motion Planning With Vision Language Models

VinePT-Map: Pole-Trunk Semantic Mapping for Resilient Autonomous Robotics in Vineyards

AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model

GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins

SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Lifelong Language-Conditioned Robotic Manipulation Learning

Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation

Rethinking the Role of Collaborative Robots in Rehabilitation

Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups

From Code to Road: A Vehicle-in-the-Loop and Digital Twin-Based Framework for Central Car Server Testing in Autonomous Driving

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Latent Policy Steering through One-Step Flow Policies

Constraint-Free Static Modeling of Continuum Parallel Robot

UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data

CT-Enabled Patient-Specific Simulation and Contact-Aware Robotic Planning for Cochlear Implantation

Omni-Manip: Beyond-FOV Large-Workspace Humanoid Manipulation with Omnidirectional 3D Perception

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics

Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM

ROScopter: A Multirotor Autopilot based on ROSflight 2.0

PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

Residual RL--MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow

Observing and Controlling Features in Vision-Language-Action Models

Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions

RoboPocket: Improve Robot Policies Instantly with Your Phone

Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology

Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living

InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities

Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data

Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

Mask-aware inference with State-Space Models

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D

Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI

sFRC for assessing hallucinations in medical image restoration

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation

Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset

Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary

Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition

DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction

MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement

Privacy-Aware Camera 2.0 Technical Report

RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery

LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation

Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models