arXiv Daily Digest

179

Papers

EdgeNav-QE: QLoRA Quantization and Dynamic Early Exit for LAM-based Navigation on Edge Devices

EdgeNav-QE：基于QLoRA量化和动态提前退出的边缘设备LAM导航

Liu, Mengyun, Huang, Shanshan, Jiang, Jianan

Abstract

Large Action Models (LAMs) have shown immense potential in autonomous navigation by bridging high-level reasoning with low-level control. However, deploying these multi-billion parameter models on edge devices remains a significant challenge due to memory constraints and latency requirements. In this paper, we propose EdgeNav-QE, a novel framework that integrates Quantized Low-Rank Adaptation (QLoRA) with a dynamic early-exit (DEE) mechanism to optimize LAMs for real-time edge navigation. By quantizing the backbone to 4-bit precision and strategically placing early-exit branches, we enable the model to terminate inference early for simple navigation tasks while retaining full depth for complex decision-making. Experimental results on the Habitat-Sim environment with Matterport3D dataset using OpenVLA-7B backbone, demonstrate that EdgeNav-QE reduces inference latency by 82.7% and memory footprint by 66.7% compared to full-precision baselines, while maintaining 81.8% navigation success rate. Furthermore, it outperforms state-of-the-art static early-exit method by 17.9% in latency, demonstrating the superiority of content-aware adaptive computation for safety-critical applications.

Chinese Translation

大型行动模型（LAMs）在自主导航中展现了巨大的潜力，通过将高层次推理与低层次控制相结合。然而，由于内存限制和延迟要求，将这些数十亿参数的模型部署在边缘设备上仍然是一个重大挑战。本文提出了EdgeNav-QE，一个新颖的框架，将量化低秩适应（QLoRA）与动态提前退出（DEE）机制相结合，以优化LAM在实时边缘导航中的应用。通过将主干网络量化到4位精度，并战略性地设置提前退出分支，我们使模型能够在简单导航任务中提前终止推理，同时在复杂决策中保持完整深度。在使用OpenVLA-7B主干的Matterport3D数据集的Habitat-Sim环境中的实验结果表明，与全精度基线相比，EdgeNav-QE将推理延迟减少了82.7%，内存占用减少了66.7%，同时保持了81.8%的导航成功率。此外，它在延迟方面比最先进的静态提前退出方法提高了17.9%，展示了内容感知自适应计算在安全关键应用中的优越性。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.15837

From Conflicts to Collisions: A Two-Stage Collision Scenario-Testing Approach for Autonomous Driving Systems

从冲突到碰撞：一种针对自动驾驶系统的两阶段碰撞场景测试方法

Chen, Siyuan, Zhang, Fuyuan, Qi, Hua, Ma, Lei, Tsuchiya, Tomoyuki, Hayashi, Michio, Okada, Manabu

Abstract

Autonomous driving systems (ADS) are safety-critical and require rigorous testing before public deployment. Simulation-based scenario testing provides a safe and cost-effective alternative to extensive on-road trials, enabling efficient evaluation of ADS under diverse and high-risk conditions. However, existing approaches mainly evaluates the scenarios based on their proximity to collisions and focus on scenarios already close to collision, leaving many other hazardous situations unexplored. To bridge this, we introduce a collision-related concept of conflict as an intermediate search target and propose a two-stage scenario testing framework that first searches for conflicts and then mutates these conflict scenarios to induce actual collisions. Evaluated on Baidu Apollo, our approach reveals up to 12 distinct collision types in a single run, doubling the diversity discovered by state-of-the-art baselines while requiring fewer simulations thanks to conflict-targeted mutations. These results show that using conflicts as intermediate objectives broadens the search horizon and significantly improves the efficiency and effectiveness of ADS safety evaluation.

Chinese Translation

自动驾驶系统（ADS）是安全关键型技术，在公众部署之前需要进行严格的测试。基于仿真的场景测试提供了一种安全且具有成本效益的替代方案，能够在多样化和高风险条件下高效评估ADS，而无需进行广泛的道路试验。然而，现有的方法主要基于接近碰撞的场景进行评估，且集中于那些已接近碰撞的场景，导致许多其他危险情况未被探索。为了解决这一问题，我们引入了与碰撞相关的冲突概念作为中间搜索目标，并提出了一种两阶段场景测试框架，首先搜索冲突，然后对这些冲突场景进行变异以诱发实际碰撞。在百度Apollo上进行评估，我们的方法在一次运行中揭示了多达12种不同的碰撞类型，发现的多样性是最先进基线的两倍，同时由于冲突导向的变异，所需的仿真次数更少。这些结果表明，使用冲突作为中间目标可以拓宽搜索视野，显著提高ADS安全评估的效率和有效性。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.15838

TurboADMM: A Structure-Exploiting Parallel Solver for Multi-Agent Trajectory Optimization

TurboADMM：一种利用结构的多智能体轨迹优化并行求解器

Chen, Yucheng

Abstract

Multi-agent trajectory optimization with dense interaction networks require solving large coupled QPs at control rates, yet existing solvers fail to simultaneously exploit temporal structure, agent decomposition, and iteration similarity. One usually treats multi-agent problems monolithically when using general-purpose QP solvers (OSQP, MOSEK), which encounter scalability difficulties with agent count. Structure-exploiting solvers (HPIPM) leverage temporal structure through Riccati recursion but can be vulnerable to dense coupling constraints. We introduce TurboADMM, a specialized single-machine QP solver that achieves empirically near linear complexity in agent count through systematic co-design of three complementary components: (1) ADMM decomposition creates per-agent subproblems solvable in parallel, preserving block-tridiagonal structure under dense coupling; (2) Riccati warmstart exploits temporal structure to provide high-quality primal-dual initialization for each agent's QP; (3) parametric QP hotstart \footnote{In the paper, we refer warmstart as the technique that uses the Riccati equation results as auxiliary QP initialization for a single QP solve, while hotstart as reusing the QR factorization across QP solve iterations.}in qpOASES reuses similar KKT system factorizations across ADMM iterations.

Chinese Translation

多智能体轨迹优化在密集交互网络中需要在控制速率下求解大型耦合二次规划（QP），然而现有的求解器无法同时利用时间结构、智能体分解和迭代相似性。通常在使用通用QP求解器（如OSQP、MOSEK）时，会将多智能体问题视为一个整体，这在智能体数量增加时会遇到可扩展性困难。利用结构的求解器（如HPIPM）通过Riccati递归利用时间结构，但可能会受到密集耦合约束的影响。我们提出了TurboADMM，这是一种专门的单机QP求解器，通过系统性地共同设计三个互补组件，实现在智能体数量上的近线性复杂度：（1）ADMM分解创建每个智能体的子问题，可以并行求解，同时在密集耦合下保持块三对角结构；（2）Riccati热启动利用时间结构为每个智能体的QP提供高质量的原始-对偶初始化；（3）在qpOASES中的参数化QP热启动重用相似的KKT系统分解，以便在ADMM迭代中重复使用。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.15840

A Decade of Human-Robot Interaction through Immersive Lenses: A Literature Review on Extended Reality as a Research Instrument in Social Robotics

通过沉浸式视角的十年人机交互：扩展现实作为社会机器人研究工具的文献综述

Helgert, André, Straßmann, Carolin, Eimler, Sabrina C.

Abstract

Over the past decade, extended reality (XR), including virtual, augmented, and mixed reality, gained attention as a research instrument in human-robot interaction studies, but remains underexplored in empirical investigations of social robotics. To map the field, we systematically reviewed empirical studies from 2015 to 2025. Of 6,527 peer-reviewed articles, only 33 met strict inclusion criteria. We examined (1) how XR and virtual social robots are used and in which contexts, (2) data collection and analysis methods, (3) demographics of the researchers and participants, and (4) the stated challenges and future agendas. Our findings show that social XR-HRI research is still dominated by laboratory simulations, while crucial specifications like used hardware, software, and robots are often not reported. Robots typically act as passive and less interactive visual stimuli, while the rich biosignal (e.g., eye-tracking) and logging functions of modern head-mounted displays remain largely untapped. The research teams and samples are predominantly tech-centric, Western, young, and male, with frequent gaps in demographic reporting. Key limitations include hardware delays, small homogeneous samples, and short, shallow study cycles. We propose a five-phase roadmap to establish social XR-HRI as a reliable research medium, which includes fostering methodological innovation, a reinforced ecological validity by, e.g., using application contexts, the improvement of the robot's interaction quality, promoting diversity in the sample and the development of a social XR-HRI taxonomy. Advancing in these directions is essential for XR to mature from a lab prototype into an ecologically valid research instrument for social robotics.

Chinese Translation

在过去的十年中，扩展现实（XR），包括虚拟现实、增强现实和混合现实，作为人机交互研究工具引起了关注，但在社会机器人实证研究中仍然未得到充分探索。为了绘制该领域的全貌，我们系统性地回顾了2015年至2025年的实证研究。在6527篇经过同行评审的文章中，仅有33篇符合严格的纳入标准。我们考察了（1）XR和虚拟社交机器人如何使用及其应用背景，（2）数据收集和分析方法，（3）研究者和参与者的人口统计特征，以及（4）所述的挑战和未来议程。我们的研究结果表明，社交XR-HRI研究仍然以实验室模拟为主，而关键的规格，如使用的硬件、软件和机器人，往往未被报告。机器人通常作为被动且互动性较低的视觉刺激存在，而现代头戴式显示器丰富的生物信号（例如，眼动追踪）和记录功能仍然未被充分利用。研究团队和样本主要集中于技术导向的西方年轻男性，且人口统计报告存在频繁的缺口。主要局限性包括硬件延迟、小规模同质样本和短期、浅层的研究周期。我们提出了一个五阶段的路线图，以建立社交XR-HRI作为可靠的研究媒介，包括促进方法创新、通过使用应用背景增强生态有效性、改善机器人的互动质量、促进样本多样性以及开发社交XR-HRI分类法。在这些方向上取得进展对XR从实验室原型发展为社会机器人生态有效的研究工具至关重要。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.15864

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

ReasonNavi：人类启发的全球地图推理用于零-shot 具身导航

Ao, Yuzhuo, Wang, Anbang, Tai, Yu-Wing, Tang, Chi-Keung

Abstract

Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: https://reasonnavi.github.io/

Chinese Translation

具身代理在高效导航方面常常面临挑战，因为它们主要依赖于部分自我中心的观察，这限制了全球前瞻性并导致低效的探索。相比之下，人类使用地图进行规划：我们首先进行全球推理，然后再进行局部行动。我们提出了ReasonNavi，这是一个受人类启发的框架，通过将多模态大型语言模型（Multimodal Large Language Models, MLLMs）与确定性规划器结合，来实现这种先推理后行动的范式。ReasonNavi通过房间分割和候选目标节点采样，将自上而下的地图转换为离散推理空间。然后，在一个多阶段的过程中查询MLLM，以识别与指令（对象、图像或文本目标）最一致的候选项，有效利用模型的语义推理能力，同时避免其在连续坐标预测方面的弱点。所选的航点通过在线构建的占用地图，使用确定性动作规划器转化为可执行的轨迹，而预训练的对象检测器和分割器确保在目标处的稳健识别。这产生了一个统一的零-shot导航框架，无需对MLLM进行微调，规避了基于强化学习的策略的脆弱性，并随着基础模型的改进自然扩展。在三个导航任务中，ReasonNavi始终优于需要大量训练或重场景建模的先前方法，提供了一种可扩展、可解释且全球基础的具身导航解决方案。项目页面：https://reasonnavi.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.15872

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

MARVL：通过视觉-语言模型实现机器人操作的多阶段指导

Zhou, Xunlan, Chen, Xuanlin, Zhang, Shaowei, Li, Xiangkun, Wan, ShengHua, Hu, Xiaohai, Lei, Yuan, Gan, Le, Zhan, De-chuan

Abstract

Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

Chinese Translation

设计密集奖励函数对于高效的机器人强化学习（RL）至关重要。然而，大多数密集奖励依赖于手动工程，这在根本上限制了强化学习的可扩展性和自动化。尽管视觉-语言模型（VLMs）为奖励设计提供了一个有前景的路径，但简单的VLM奖励往往与任务进展不一致，难以进行空间定位，并且对任务语义的理解有限。为了解决这些问题，我们提出了MARVL——通过视觉-语言模型实现机器人操作的多阶段指导。MARVL对VLM进行微调，以确保空间和语义的一致性，并将任务分解为多阶段子任务，通过任务方向投影实现轨迹敏感性。实证结果表明，MARVL在Meta-World基准测试中显著优于现有的VLM奖励方法，展示了在稀疏奖励操作任务上的卓越样本效率和鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.15873

Test-Time Adaptation for Tactile-Vision-Language Models

触觉-视觉-语言模型的测试时适应

Ye, Chuyang, Jing, Haoxian, Jiang, Qinting, Lin, Yixi, Li, Qiang, Tang, Xing, Jiang, Jingyan

Abstract

Tactile-vision-language (TVL) models are increasingly deployed in real-world robotic and multimodal perception tasks, where test-time distribution shifts are unavoidable. Existing test-time adaptation (TTA) methods provide filtering in unimodal settings but lack explicit treatment of modality-wise reliability under asynchronous cross-modal shifts, leaving them brittle when some modalities become unreliable. We study TTA for TVL models under such shifts and propose a reliability-aware framework that estimates per-modality reliability from prediction uncertainty and perturbation-based responses. This shared reliability signal is used to (i) filter unreliable test samples, (ii) adaptively fuse tactile, visual, and language features, and (iii) regularize test-time optimization with a reliability-guided objective. On the TAG-C benchmark and additional TVL scenarios, our approach consistently outperforms strong TTA baselines, achieving accuracy gains of up to 49.9\% under severe modality corruptions, underscoring the importance of explicit modality-wise reliability modeling for robust test-time adaptation.

Chinese Translation

触觉-视觉-语言（TVL）模型越来越多地应用于现实世界的机器人和多模态感知任务中，其中测试时分布的变化是不可避免的。现有的测试时适应（TTA）方法在单模态设置中提供了过滤，但在异步跨模态变化下缺乏对模态可靠性的明确处理，使得在某些模态变得不可靠时表现脆弱。我们研究了在这种变化下的TVL模型的TTA，并提出了一种可靠性感知框架，该框架通过预测不确定性和基于扰动的响应来估计每个模态的可靠性。这个共享的可靠性信号用于（i）过滤不可靠的测试样本，（ii）自适应融合触觉、视觉和语言特征，以及（iii）通过可靠性引导的目标来正则化测试时优化。在TAG-C基准和其他TVL场景中，我们的方法始终优于强大的TTA基线，在严重模态损坏下实现了高达49.9\%的准确率提升，强调了明确的模态可靠性建模对于稳健的测试时适应的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.15875

Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

Fly0：将语义基础与几何规划解耦以实现零-shot空中导航

Xu, Zhenxing, Lu, Brikit, Bao, Weidong, Zhu, Zhengqiu, Zhang, Junsong, Yan, Hui, Lu, Wenhao, Wang, Ji

Abstract

Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20\% and reducing Navigation Error (NE) by approximately 50\% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.

Chinese Translation

当前的视觉-语言导航（VLN）方法面临语义理解与控制精度之间的权衡。虽然多模态大型语言模型（MLLMs）提供了卓越的推理能力，但将其作为低级控制器时会导致高延迟、轨迹振荡以及由于几何基础薄弱而造成的较差泛化。为了解决这些局限性，我们提出了Fly0，一个将语义推理与几何规划解耦的框架。该方法通过三个阶段的管道进行操作：（1）一个由MLLM驱动的模块，将自然语言指令基础到二维像素坐标；（2）一个几何投影模块，利用深度数据在三维空间中定位目标；（3）一个几何规划器，生成无碰撞的轨迹。该机制使得即使在失去视觉接触的情况下也能实现稳健的导航。通过消除对持续推理的需求，Fly0减少了计算开销并提高了系统稳定性。在模拟和真实环境中的大量实验表明，Fly0的性能优于最先进的基线，成功率提高了超过20\%，在非结构化环境中导航误差（NE）减少了约50\%。我们的代码可在 https://github.com/xuzhenxing1/Fly0 获取。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.15882

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

FUTURE-VLA：实时执行下的统一轨迹预测

Fan, Jingjing, Liu, Yushan, Li, Shoujie, Ren, Botao, Li, Siyuan, Zhang, Xiao-Ping, Ding, Wenbo, Deng, Zhidong

Abstract

General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16\times$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

Chinese Translation

通用视觉-语言模型日益支持对长视频流的统一时空推理，但在机器人上部署这些能力仍受到处理长时间历史和生成高维未来预测的高延迟的限制。为了弥补这一差距，我们提出了FUTURE-VLA，一种将长时间控制和未来预测重新构造为单一序列生成任务的统一架构。FUTURE-VLA采用双面效率范式，利用时间自适应压缩策略最大化时空信息密度，使其能够在保持恒定推理延迟的同时摄取大量多视角历史数据。同时，它在潜在空间上进行自回归，以在单次前向传递中将可操作动态与可回顾的视觉前瞻对齐。这些实时预测能力进一步通过交互执行门控实现了预测引导的人机协同机制，使操作员能够根据可解释的未来预览动态验证行为。广泛的评估表明，FUTURE-VLA建立了新的最先进性能，在LIBERO上达到了99.2%的成功率，在RoboTwin上达到了75.4%，在真实世界的Piper平台上达到了78.0%，所有这些都在$16 imes$扩展的时空窗口下，同时保持单帧基线的推理延迟。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.15884

The SLAM Confidence Trap

SLAM 信心陷阱

Sansoni, Sebastian, Sanz, Santiago Ramón Tosetti

Abstract

The SLAM community has fallen into a "Confidence Trap" by prioritizing benchmark scores over principled uncertainty estimation. This yields systems that are geometrically accurate but probabilitistically inconsistent and brittle. We advocate for a paradigm shift where the consistent, real-time computation of uncertainty becomes a primary metric of success.

Chinese Translation

SLAM 社区陷入了一个“信心陷阱”，优先考虑基准分数而非原则性的不确定性估计。这导致了几何上准确但概率上不一致且脆弱的系统。我们倡导一种范式转变，使得一致的、实时的不确定性计算成为成功的主要指标。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.15885

A novel Integrated Motion Tracking Device (IMTD) for Objective Laparoscopic Training Assessment: Development and Validation

一种新型综合运动追踪设备（IMTD）用于客观腹腔镜培训评估：开发与验证

Bouzid, Siwar, Chaker, Abdelbadia, Arsicault, Marc, Bennour, Sami, Laribi, Med Amine

Abstract

This paper presents a novel, compact four-degree-of-freedom motion-tracking device (IMTD) designed for training and evaluation in laparoscopic surgery. The device's kinematics, mechanical design, instrumentation, and prototypes are developed and presented to meet the specific requirements of laparoscopic training context, including movement around a fixed center of motion and seamless integration into standard box trainers. The system IMTD's tracking accuracy and reliability are compared to a motion capture system (MoCap), assessing its ability to capture both angular and translational motions of surgical instruments. The study then focuses on key performance parameters including precision, fluidity, speed, and overall motion efficiency. The results highlight the system's effectiveness in tracking surgical gestures, providing valuable insights into its potential as a tool for training and performance evaluation in minimally invasive surgery. Additionally, IMTD's low cost and integrated design allow for easy integration and implementation in training rooms, offering a practical and accessible solution for general use. By offering objective, real-time feedback, the system can significantly contribute to improving surgical skills and shortening the learning curve for novice students, while also providing a foundation for future development of gesture scoring algorithms and standardized training protocols.

Chinese Translation

本文介绍了一种新型、紧凑的四自由度运动追踪设备（IMTD），旨在用于腹腔镜手术的培训和评估。该设备的运动学、机械设计、仪器和原型经过开发和展示，以满足腹腔镜培训环境的特定要求，包括围绕固定运动中心的运动以及与标准箱体训练器的无缝集成。系统IMTD的追踪精度和可靠性与运动捕捉系统（MoCap）进行了比较，评估其捕捉外科器械的角度和位移运动的能力。研究重点关注关键性能参数，包括精度、流畅性、速度和整体运动效率。结果突显了该系统在追踪外科手势方面的有效性，为其作为微创手术培训和性能评估工具的潜力提供了宝贵的见解。此外，IMTD的低成本和集成设计使其能够轻松集成和实施于培训室，为一般使用提供了实用且可获取的解决方案。通过提供客观的实时反馈，该系统可以显著提升外科技能，并缩短新手学生的学习曲线，同时为未来手势评分算法和标准化培训协议的发展奠定基础。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.15886

Optimization of an Augmented R-CUBE mechanism for Cervical Surgery

颈椎手术中增强型 R-CUBE 机制的优化

Essomba, Terence, Wu, Yu-Wen, Chaker, Abdelbadia, Laribi, Med Amine

Abstract

In some surgical operations targeting the spine, it is required to drill cavities in the vertebrae for the insertion of pedicle screws. A new mechanical architecture is proposed for this application. It is based on an augmented version of the full translational R-CUBE mechanism, with improved linkages to implement additional rotational motion. Using this concept, a mechanism presented with a 3T2R motion that is required for the manipulation of the surgical drill. It is mainly composed three stages: one translational, one transmitting and one rotational. Their respective kinematic and velocity models are separately derived, then combined. Based on the drilling trajectories obtained from a real patient case, the mechanism is optimized for generating the highest kinematic performances.

Chinese Translation

在一些针对脊柱的外科手术中，需要在椎骨中钻孔以插入椎弓根螺钉。为此应用提出了一种新的机械结构。该结构基于全平移 R-CUBE 机制的增强版本，改进了连杆以实现额外的旋转运动。利用这一概念，提出了一种具有 3T2R（3个平移自由度和2个旋转自由度）运动的机制，适用于手术钻的操作。该机制主要由三个阶段组成：一个平移阶段、一个传递阶段和一个旋转阶段。分别推导出它们的运动学和速度模型，然后进行组合。基于从真实患者案例中获得的钻孔轨迹，优化该机制以生成最高的运动学性能。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.15891

Learning to Drive in New Cities Without Human Demonstrations

在没有人类示范的情况下学习在新城市驾驶

Wang, Zilin, Rahmani, Saeed, Cornelisse, Daphne, Sarkar, Bidipta, Goldie, Alexander David, Foerster, Jakob Nicolaus, Whiteson, Shimon

Abstract

While autonomous vehicles have achieved reliable performance within specific operating regions, their deployment to new cities remains costly and slow. A key bottleneck is the need to collect many human demonstration trajectories when adapting driving policies to new cities that differ from those seen in training in terms of road geometry, traffic rules, and interaction patterns. In this paper, we show that self-play multi-agent reinforcement learning can adapt a driving policy to a substantially different target city using only the map and meta-information, without requiring any human demonstrations from that city. We introduce NO data Map-based self-play for Autonomous Driving (NOMAD), which enables policy adaptation in a simulator constructed based on the target-city map. Using a simple reward function, NOMAD substantially improves both task success rate and trajectory realism in target cities, demonstrating an effective and scalable alternative to data-intensive city-transfer methods. Project Page: https://nomaddrive.github.io/

Chinese Translation

尽管自动驾驶车辆在特定操作区域内已实现可靠性能，但其在新城市的部署仍然成本高昂且缓慢。一个关键瓶颈是需要收集大量人类示范轨迹，以便在适应与训练中所见的城市在道路几何、交通规则和交互模式上存在差异的新城市时调整驾驶策略。本文展示了自我对弈多智能体强化学习如何仅利用地图和元信息，适应一个与目标城市有显著不同的驾驶策略，而无需该城市的任何人类示范。我们提出了基于无数据地图的自我对弈自动驾驶（NOMAD），该方法能够在基于目标城市地图构建的模拟器中实现策略适应。通过使用简单的奖励函数，NOMAD显著提高了目标城市的任务成功率和轨迹真实感，展示了一种有效且可扩展的替代方案，优于数据密集型的城市转移方法。项目页面：https://nomaddrive.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.15893

Statistical-Geometric Degeneracy in UAV Search: A Physics-Aware Asymmetric Filtering Approach

无人机搜索中的统计-几何退化：一种物理感知的非对称滤波方法

Ren, Zhiyuan, Fang, Yudong, Zhang, Tao, Cheng, Wenchi, Lan, Ben

Abstract

Post-disaster survivor localization using Unmanned Aerial Vehicles (UAVs) faces a fundamental physical challenge: the prevalence of Non-Line-of-Sight (NLOS) propagation in collapsed structures. Unlike standard Gaussian noise, signal reflection from debris introduces strictly non-negative ranging biases. Existing robust estimators, typically designed with symmetric loss functions (e.g., Huber or Tukey), implicitly rely on the assumption of error symmetry. Consequently, they experience a theoretical mismatch in this regime, leading to a phenomenon we formally identify as Statistical-Geometric Degeneracy (SGD)-a state where the estimator stagnates due to the coupling of persistent asymmetric bias and limited observation geometry. While emerging data-driven approaches offer alternatives, they often struggle with the scarcity of training data and the sim-to-real gap inherent in unstructured disaster zones. In this work, we propose a physically-grounded solution, the AsymmetricHuberEKF, which explicitly incorporates the non-negative physical prior of NLOS biases via a derived asymmetric loss function. Theoretically, we show that standard symmetric filters correspond to a degenerate case of our framework where the physical constraint is relaxed. Furthermore, we demonstrate that resolving SGD requires not just a robust filter, but specific bilateral information, which we achieve through a co-designed active sensing strategy. Validated in a 2D nadir-view scanning scenario, our approach significantly accelerates convergence compared to symmetric baselines, offering a resilient building block for search operations where data is scarce and geometry is constrained.

Chinese Translation

利用无人机（UAV）进行灾后幸存者定位面临一个基本的物理挑战：在倒塌结构中非视距（NLOS）传播的普遍存在。与标准的高斯噪声不同，来自碎片的信号反射引入了严格的非负测距偏差。现有的鲁棒估计器通常采用对称损失函数（例如，Huber或Tukey）设计，隐含地依赖于误差对称性的假设。因此，它们在这一领域中经历了理论上的不匹配，导致我们正式识别为统计-几何退化（SGD）的现象——一种由于持续的非对称偏差与有限的观测几何耦合而使估计器停滞的状态。尽管新兴的数据驱动方法提供了替代方案，但它们常常在训练数据稀缺和非结构化灾区固有的仿真与现实差距方面面临挑战。在本研究中，我们提出了一种基于物理的解决方案——非对称Huber扩展卡尔曼滤波器（AsymmetricHuberEKF），该方法通过推导的非对称损失函数明确地结合了NLOS偏差的非负物理先验。从理论上讲，我们表明标准的对称滤波器对应于我们框架的一个退化情况，其中物理约束被放宽。此外，我们证明了解决SGD不仅需要一个鲁棒的滤波器，还需要特定的双边信息，我们通过共同设计的主动感知策略实现了这一点。在二维正下方扫描场景中进行验证，我们的方法相比对称基线显著加快了收敛速度，为数据稀缺和几何受限的搜索操作提供了一个强健的构建模块。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.15899

VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

基于VGGT的在线3D语义SLAM用于室内场景理解和导航

Gelencsér-Horváth, Anna, Dinya, Gergely, Erős, Dorka Boglárka, Halász, Péter, Muqsit, Islam Muhammad, Karacs, Kristóf

Abstract

We present SceneVGGT, a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. Built on VGGT, our method scales to long video streams via a sliding-window pipeline. We align local submaps using camera-pose transformations, enabling memory- and speed-efficient mapping while preserving geometric consistency. Semantics are lifted from 2D instance masks to 3D objects using the VGGT tracking head, maintaining temporally coherent identities for change detection. As a proof of concept, object locations are projected onto an estimated floor plane for assistive navigation. The pipeline's GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast enough to support interactive assistive navigation with audio feedback.

Chinese Translation

我们提出了SceneVGGT，一个时空3D场景理解框架，将SLAM与语义映射结合用于自主和辅助导航。基于VGGT，我们的方法通过滑动窗口管道扩展到长视频流。我们使用相机姿态变换对局部子地图进行对齐，实现了内存和速度高效的映射，同时保持几何一致性。语义通过VGGT跟踪头从2D实例掩膜提升到3D对象，保持时间一致的身份以进行变化检测。作为概念验证，物体位置被投影到估计的地面平面上以支持辅助导航。该管道的GPU内存使用保持在17 GB以下，无论输入序列的长度如何，并在ScanNet++基准测试中实现了竞争性的点云性能。总体而言，SceneVGGT确保了稳健的语义识别，并且足够快速以支持带音频反馈的交互式辅助导航。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.15900

Adaptive Illumination Control for Robot Perception

机器人感知的自适应照明控制

Turkar, Yash, Sadeghi, Shekoufeh, Dantu, Karthik

Abstract

Robot perception under low light or high dynamic range is usually improved downstream - via more robust feature extraction, image enhancement, or closed-loop exposure control. However, all of these approaches are limited by the image captured these conditions. An alternate approach is to utilize a programmable onboard light that adds to ambient illumination and improves captured images. However, it is not straightforward to predict its impact on image formation. Illumination interacts nonlinearly with depth, surface reflectance, and scene geometry. It can both reveal structure and induce failure modes such as specular highlights and saturation. We introduce Lightning, a closed-loop illumination-control framework for visual SLAM that combines relighting, offline optimization, and imitation learning. This is performed in three stages. First, we train a Co-Located Illumination Decomposition (CLID) relighting model that decomposes a robot observation into an ambient component and a light-contribution field. CLID enables physically consistent synthesis of the same scene under alternative light intensities and thereby creates dense multi-intensity training data without requiring us to repeatedly re-run trajectories. Second, using these synthesized candidates, we formulate an offline Optimal Intensity Schedule (OIS) problem that selects illumination levels over a sequence trading off SLAM-relevant image utility against power consumption and temporal smoothness. Third, we distill this ideal solution into a real-time controller through behavior cloning, producing an Illumination Control Policy (ILC) that generalizes beyond the initial training distribution and runs online on a mobile robot to command discrete light-intensity levels. Across our evaluation, Lightning substantially improves SLAM trajectory robustness while reducing unnecessary illumination power.

Chinese Translation

在低光照或高动态范围下，机器人的感知通常通过更强大的特征提取、图像增强或闭环曝光控制等下游方法来改善。然而，所有这些方法都受到在这些条件下捕获的图像的限制。另一种方法是利用可编程的机载光源，增加环境光照并改善捕获的图像。然而，预测其对图像形成的影响并不简单。照明与深度、表面反射率和场景几何形状之间存在非线性交互。它既可以揭示结构，也可能导致如镜面高光和饱和等失效模式。我们提出了Lightning，一个用于视觉SLAM的闭环照明控制框架，结合了重照明、离线优化和模仿学习。该过程分为三个阶段。首先，我们训练一个共定位照明分解（Co-Located Illumination Decomposition, CLID）重照明模型，该模型将机器人观察分解为环境成分和光贡献场。CLID使得在不同光强下对同一场景进行物理一致的合成成为可能，从而创建密集的多强度训练数据，而无需我们重复运行轨迹。其次，利用这些合成的候选项，我们制定了一个离线最优光强调度（Optimal Intensity Schedule, OIS）问题，该问题在一个序列中选择照明水平，以权衡与SLAM相关的图像效用、功耗和时间平滑性。第三，我们通过行为克隆将这一理想解决方案提炼为实时控制器，生成一个照明控制策略（Illumination Control Policy, ILC），该策略超越初始训练分布并在线运行于移动机器人上，以指挥离散的光强水平。在我们的评估中，Lightning显著提高了SLAM轨迹的鲁棒性，同时减少了不必要的照明功率。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.15901

Coverage Path Planning for Autonomous Sailboats in Inhomogeneous and Time-Varying Oceans: A Spatiotemporal Optimization Approach

在不均匀和时变海洋中为自主帆船规划覆盖路径：一种时空优化方法

An, Yang, Ge, Zhikang, Zhang, Taiyu, Souppez, Jean-Baptiste R. G., Xu, Gaofei, Ren, Zhengru

Abstract

Autonomous sailboats are well suited for long-duration ocean observation due to their wind-driven endurance. However, their performance is highly anisotropic and strongly influenced by inhomogeneous and time-varying wind and current fields, limiting the effectiveness of existing coverage methods such as boustrophedon sweeping. Planning under these environmental and maneuvering constraints remains underexplored. This paper presents a spatiotemporal coverage path planning framework tailored to autonomous sailboats, combining (1) topology-based morphological constraints in the spatial domain to promote compact and continuous coverage, and (2) forecast-aware look-ahead planning in the temporal domain to anticipate environmental evolution and enable foresighted decision-making. Simulations conducted under stochastic inhomogeneous and time-varying ocean environments, including scenarios with partial directional accessibility, demonstrate that the proposed method generates efficient and feasible coverage paths where traditional strategies often fail. To the best of our knowledge, this study provides the first dedicated solution to the coverage path planning problem for autonomous sailboats operating in inhomogeneous and time-varying ocean environments, establishing a foundation for future cooperative multi-sailboat coverage.

Chinese Translation

自主帆船由于其风驱动的耐力，非常适合进行长期的海洋观测。然而，它们的性能高度各向异性，且受到不均匀和时变的风流场的强烈影响，这限制了现有覆盖方法（如交替扫掠）的有效性。在这些环境和机动约束下的规划仍然未得到充分探索。本文提出了一种针对自主帆船的时空覆盖路径规划框架，结合了（1）基于拓扑的空间域形态约束，以促进紧凑和连续的覆盖，以及（2）基于预测的时间域前瞻性规划，以预见环境演变并实现前瞻性的决策。我们在随机不均匀和时变的海洋环境下进行的仿真实验，包括部分方向可达性的场景，表明所提出的方法能够生成高效且可行的覆盖路径，而传统策略往往失败。根据我们的最佳知识，这项研究为在不均匀和时变海洋环境中操作的自主帆船的覆盖路径规划问题提供了首个专门的解决方案，为未来的多帆船协作覆盖奠定了基础。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.15922

World Action Models are Zero-shot Policies

世界动作模型是零样本策略

Ye, Seonghyeon, Ge, Yunhao, Zheng, Kaiyuan, Gao, Shenyuan, Yu, Sihyun, Kurian, George, Indupuru, Suneel, Tan, You Liang, Zhu, Chuning, Xiang, Jiannan, Malik, Ayaan, Lee, Kyungmin, Liang, William, Ranawaka, Nadun, Gu, Jiasheng, Xu, Yinzhen, Wang, Guanzhi, Hu, Fengyuan, Narayan, Avnish, Bjorck, Johan, Wang, Jing, Kim, Gwanghyun, Niu, Dantong, Zheng, Ruijie, Xie, Yuqi, Wu, Jimmy, Wang, Qi, Julian, Ryan, Xu, Danfei, Du, Yilun, Chebotar, Yevgen, Reed, Scott, Kautz, Jan, Zhu, Yuke, Fan, Linxi "Jim", Jang, Joel

Abstract

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

Chinese Translation

最先进的视觉-语言-动作（VLA）模型在语义泛化方面表现出色，但在新环境中对未见物理动作的泛化能力较弱。我们提出了DreamZero，一个基于预训练视频扩散骨干网构建的世界动作模型（WAM）。与VLA不同，WAM通过预测未来的世界状态和动作来学习物理动态，利用视频作为世界演变的密集表示。通过联合建模视频和动作，DreamZero能够有效地从异构机器人数据中学习多样化的技能，而无需依赖重复的演示。这使得在真实机器人实验中，相较于最先进的VLA，DreamZero在新任务和新环境的泛化能力提高了超过2倍。关键是，通过模型和系统的优化，我们使得一个140亿参数的自回归视频扩散模型能够以7Hz的频率进行实时闭环控制。最后，我们展示了两种形式的跨体现迁移：来自其他机器人或人类的视频演示在未见任务表现上相较于基线提高了超过42%，仅需10-20分钟的数据。更令人惊讶的是，DreamZero实现了少样本体现适应，仅需30分钟的游戏数据即可迁移到新的体现，同时保持零样本泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.15954

Hybrid Model Predictive Control with Physics-Informed Neural Network for Satellite Attitude Control

基于物理信息神经网络的混合模型预测控制在卫星姿态控制中的应用

Cena, Carlo, Martini, Mauro, Chiaberge, Marcello

Abstract

Reliable spacecraft attitude control depends on accurate prediction of attitude dynamics, particularly when model-based strategies such as Model Predictive Control (MPC) are employed, where performance is limited by the quality of the internal system model. For spacecraft with complex dynamics, obtaining accurate physics-based models can be difficult, time-consuming, or computationally heavy. Learning-based system identification presents a compelling alternative; however, models trained exclusively on data frequently exhibit fragile stability properties and limited extrapolation capability. This work explores Physics-Informed Neural Networks (PINNs) for modeling spacecraft attitude dynamics and contrasts it with a conventional data-driven approach. A comprehensive dataset is generated using high-fidelity numerical simulations, and two learning methodologies are investigated: a purely data-driven pipeline and a physics-regularized approach that incorporates prior knowledge into the optimization process. The results indicate that embedding physical constraints during training leads to substantial improvements in predictive reliability, achieving a 68.17% decrease in mean relative error relative. When deployed within an MPC architecture, the physics-informed models yield superior closed-loop tracking performance and improved robustness to uncertainty. Furthermore, a hybrid control formulation that merges the learned nonlinear dynamics with a nominal linear model enables consistent steady-state convergence and significantly faster response, reducing settling times by 61.52%-76.42% under measurement noise and reaction wheel friction.

Chinese Translation

可靠的航天器姿态控制依赖于对姿态动态的准确预测，特别是在采用基于模型的策略（如模型预测控制 Model Predictive Control, MPC）时，其性能受到内部系统模型质量的限制。对于具有复杂动态的航天器，获得准确的基于物理的模型可能会很困难、耗时或计算量大。基于学习的系统识别提供了一种引人注目的替代方案；然而，单纯依赖数据训练的模型通常表现出脆弱的稳定性特性和有限的外推能力。本研究探讨了物理信息神经网络（Physics-Informed Neural Networks, PINNs）在建模航天器姿态动态中的应用，并与传统的数据驱动方法进行了对比。通过高保真数值模拟生成了一个综合数据集，并研究了两种学习方法：一种是纯数据驱动的流程，另一种是将先验知识纳入优化过程的物理正则化方法。结果表明，在训练过程中嵌入物理约束显著提高了预测的可靠性，平均相对误差降低了68.17%。当在MPC架构中部署时，基于物理的信息模型提供了优越的闭环跟踪性能和对不确定性的改善鲁棒性。此外，结合学习到的非线性动态与名义线性模型的混合控制形式使得稳态收敛一致，并显著加快响应速度，在测量噪声和反应轮摩擦下，沉降时间减少了61.52%-76.42%。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.15963

The human intention. A taxonomy attempt and its applications to robotics

人类意图：一种分类尝试及其在机器人学中的应用

Domínguez-Vidal, J. E., Sanfeliu, Alberto

Abstract

Despite a surge in robotics research dedicated to inferring and understanding human intent, a universally accepted definition remains elusive since existing works often equate human intention with specific task-related goals. This article seeks to address this gap by examining the multifaceted nature of intention. Drawing on insights from psychology, it attempts to consolidate a definition of intention into a comprehensible framework for a broader audience. The article classifies different types of intention based on psychological and communication studies, offering guidance to researchers shifting from pure technical enhancements to a more human-centric perspective in robotics. It then demonstrates how various robotics studies can be aligned with these intention categories. Finally, through in-depth analyses of collaborative search and object transport use cases, the article underscores the significance of considering the diverse facets of human intention.

Chinese Translation

尽管近年来机器人研究在推断和理解人类意图方面取得了显著进展，但由于现有研究往往将人类意图等同于特定任务相关的目标，普遍接受的定义仍然难以实现。本文旨在填补这一空白，通过考察意图的多面性来探讨其定义。借鉴心理学的见解，本文试图将意图的定义整合成一个易于理解的框架，以便更广泛的受众使用。文章根据心理学和传播学研究对不同类型的意图进行了分类，为研究人员从纯技术增强转向更以人为本的机器人学视角提供了指导。接着，文章展示了如何将各种机器人研究与这些意图类别对齐。最后，通过对协作搜索和物体运输用例的深入分析，本文强调了考虑人类意图多样性的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.16005

ODYN: An All-Shifted Non-Interior-Point Method for Quadratic Programming in Robotics and AI

ODYN：一种用于机器人和人工智能的全偏移非内部点二次规划方法

Rojas, Jose, Papatheodorou, Aristotelis, Martinez, Sergi, Havoutis, Ioannis, Mastalli, Carlos

Abstract

We introduce ODYN, a novel all-shifted primal-dual non-interior-point quadratic programming (QP) solver designed to efficiently handle challenging dense and sparse QPs. ODYN combines all-shifted nonlinear complementarity problem (NCP) functions with proximal method of multipliers to robustly address ill-conditioned and degenerate problems, without requiring linear independence of the constraints. It exhibits strong warm-start performance and is well suited to both general-purpose optimization, and robotics and AI applications, including model-based control, estimation, and kernel-based learning methods. We provide an open-source implementation and benchmark ODYN on the Maros-M\'esz\'aros test set, demonstrating state-of-the-art convergence performance in small-to-high-scale problems. The results highlight ODYN's superior warm-starting capabilities, which are critical in sequential and real-time settings common in robotics and AI. These advantages are further demonstrated by deploying ODYN as the backend of an SQP-based predictive control framework (OdynSQP), as the implicitly differentiable optimization layer for deep learning (ODYNLayer), and the optimizer of a contact-dynamics simulation (ODYNSim).

Chinese Translation

我们介绍了ODYN，一种新颖的全偏移原始-对偶非内部点二次规划（QP）求解器，旨在高效处理具有挑战性的稠密和稀疏QP。ODYN结合了全偏移非线性互补问题（NCP）函数与乘子近端方法，稳健地解决病态和退化问题，而无需约束条件的线性独立性。它表现出强大的热启动性能，适用于通用优化以及机器人和人工智能应用，包括基于模型的控制、估计和基于核的学习方法。我们提供了一个开源实现，并在Maros-Mészáros测试集上对ODYN进行了基准测试，展示了其在小规模到大规模问题中的最先进收敛性能。结果突显了ODYN卓越的热启动能力，这在机器人和人工智能中常见的序列和实时设置中至关重要。这些优势通过将ODYN作为基于SQP的预测控制框架（OdynSQP）的后端、作为深度学习的隐式可微优化层（ODYNLayer）以及作为接触动力学仿真的优化器（ODYNSim）进一步得以展示。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.16035

The Impact of Class Uncertainty Propagation in Perception-Based Motion Planning

感知基础运动规划中类别不确定性传播的影响

Shah, Jibran Iqbal, Ivanovic, Andrei, Zhu, Kelly, Itkina, Masha, McAllister, Rowan, Gilitschenski, Igor, Shkurti, Florian

Abstract

Autonomous vehicles (AVs) are being increasingly deployed in urban environments. In order to operate safely and reliably, AVs need to account for the inherent uncertainty associated with perceiving the world through sensor data and incorporate that into their decision-making process. Uncertainty-aware planners have recently been developed to account for upstream perception and prediction uncertainty. However, such planners may be sensitive to prediction uncertainty miscalibration, the magnitude of which has not yet been characterized. Towards this end, we perform a detailed analysis on the impact that perceptual uncertainty propagation and calibration has on perception-based motion planning. We do so by comparing two novel prediction-planning pipelines with varying levels of uncertainty propagation on the recently-released nuPlan planning benchmark. We study the impact of upstream uncertainty calibration using closed-loop evaluation on the nuPlan challenge scenarios. We find that the method incorporating upstream uncertainty propagation demonstrates superior generalization to complex closed-loop scenarios.

Chinese Translation

自主车辆（AVs）正越来越多地在城市环境中部署。为了安全可靠地运行，自主车辆需要考虑通过传感器数据感知世界所固有的不确定性，并将其纳入决策过程。最近开发了考虑不确定性的规划器，以应对上游感知和预测的不确定性。然而，这些规划器可能对预测不确定性的误校准敏感，而这种敏感性尚未被充分表征。为此，我们对感知不确定性传播和校准对感知基础运动规划的影响进行了详细分析。我们通过比较两个具有不同不确定性传播水平的新型预测-规划管道，在最近发布的nuPlan规划基准上进行研究。我们使用闭环评估研究上游不确定性校准在nuPlan挑战场景中的影响。我们发现，纳入上游不确定性传播的方法在复杂的闭环场景中表现出更优的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.16073

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

ScenicRules：具有多目标规范和抽象场景的自动驾驶基准

Chang, Kevin Kai-Chun, Beyazit, Ekin, Sangiovanni-Vincentelli, Alberto, Wongpiromsarn, Tichakorn, Seshia, Sanjit A.

Abstract

Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions, obeying traffic rules, and making efficient progress. In many situations, these objectives cannot be satisfied simultaneously, and explicit priority relations naturally arise. Also, driving rules require context, so it is important to formally model the environment scenarios within which such rules apply. Existing benchmarks for evaluating autonomous vehicles lack such combinations of multi-objective prioritized rules and formal environment models. In this work, we introduce ScenicRules, a benchmark for evaluating autonomous driving systems in stochastic environments under prioritized multi-objective specifications. We first formalize a diverse set of objectives to serve as quantitative evaluation metrics. Next, we design a Hierarchical Rulebook framework that encodes multiple objectives and their priority relations in an interpretable and adaptable manner. We then construct a compact yet representative collection of scenarios spanning diverse driving contexts and near-accident situations, formally modeled in the Scenic language. Experimental results show that our formalized objectives and Hierarchical Rulebooks align well with human driving judgments and that our benchmark effectively exposes agent failures with respect to the prioritized objectives. Our benchmark can be accessed at https://github.com/BerkeleyLearnVerify/ScenicRules/.

Chinese Translation

在复杂交通环境中开发自动驾驶系统需要平衡多个目标，例如避免碰撞、遵守交通规则和高效行驶。在许多情况下，这些目标无法同时满足，因此自然会出现明确的优先关系。此外，驾驶规则需要上下文，因此正式建模这些规则适用的环境场景非常重要。现有的用于评估自动驾驶车辆的基准缺乏多目标优先规则和正式环境模型的组合。在本研究中，我们介绍了ScenicRules，这是一个用于在随机环境中根据优先多目标规范评估自动驾驶系统的基准。我们首先形式化了一组多样的目标，以作为定量评估指标。接下来，我们设计了一个层次化规则集框架，以可解释和可适应的方式编码多个目标及其优先关系。然后，我们构建了一个紧凑而具有代表性的场景集合，涵盖了多种驾驶背景和接近事故的情况，并在Scenic语言中进行了正式建模。实验结果表明，我们形式化的目标和层次化规则集与人类驾驶判断高度一致，并且我们的基准有效地揭示了代理在优先目标方面的失败。我们的基准可以在 https://github.com/BerkeleyLearnVerify/ScenicRules/ 访问。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.16127

Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization

多指抓取中的反应滑移控制：混合触觉感知与内部力优化

Ayral, Théo, Aloui, Saifeddine, Grossard, Mathieu

Abstract

We present a hybrid learning and model-based approach that adapts internal grasp forces to halt in-hand slip on a multifingered robotic gripper. A multimodal tactile stack combines piezoelectric (PzE) sensing for fast slip cues with piezoresistive (PzR) arrays for contact localization, enabling online construction of the grasp matrix. Upon slip, we update internal forces computed in the null space of the grasp via a quadratic program that preserves the object wrench while enforcing actuation limits. The pipeline yields a theoretical sensing-to-command latency of 35-40 ms, with 5 ms for PzR-based contact and geometry updates and about 4 ms for the quadratic program solve. In controlled trials, slip onset is detected at 20ms. We demonstrate closed-loop stabilization on multifingered grasps under external perturbations. Augmenting efficient analytic force control with learned tactile cues yields both robustness and rapid reactions, as confirmed in our end-to-end evaluation. Measured delays are dominated by the experimental data path rather than actual computation. The analysis outlines a clear route to sub-50 ms closed-loop stabilization.

Chinese Translation

我们提出了一种混合学习与基于模型的方法，该方法调整内部抓取力以停止多指机器人夹持器中的滑移。一个多模态触觉传感器堆叠结合了压电（PzE）传感器用于快速滑移提示，以及压阻（PzR）阵列用于接触定位，从而实现抓取矩阵的在线构建。在滑移发生时，我们通过一个二次规划更新在抓取的零空间中计算的内部力，该规划在保持物体扭矩的同时强制执行驱动限制。该流程的理论感知到指令延迟为35-40毫秒，其中PzR基础的接触和几何更新约为5毫秒，二次规划求解约为4毫秒。在控制试验中，滑移开始的检测时间为20毫秒。我们展示了在外部扰动下对多指抓取的闭环稳定化。将高效的解析力控制与学习到的触觉提示相结合，既提高了鲁棒性，又实现了快速反应，这在我们的端到端评估中得到了证实。测量的延迟主要由实验数据路径主导，而非实际计算。分析清晰地勾勒出实现低于50毫秒闭环稳定化的路线。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.16178

Image Measurement Method for Automatic Insertion of Forks into Inclined Pallet

用于自动插入叉子的倾斜托盘图像测量方法

Kita, Nobuyuki, Kato, Takuro

Abstract

In order to insert a fork into a hole of a pallet by a forklift located in front of a pallet, it is necessary to control the height position, reach position, and tilt angle of the fork to match the position and orientation of the hole of the pallet. In order to make AGF (Autonomous Guided Forklift) do this automatically, we propose an image measurement method to measure the pitch inclination of the pallet in the camera coordinate system from an image obtained by using a wide-angle camera. In addition, we propose an image measurement method to easily acquire the calibration information between the camera coordinate system and the fork coordinate system necessary to apply the measurements in the camera coordinate system to the fork control. In the experiment space, a wide-angle camera was fixed at the backrest of a reach type forklift. The wide-angle images taken by placing a pallet in front of the camera were processed. As a result of evaluating the error by comparing the image measurement value with the hand measurement value when changing the pitch inclination angle of the pallet, the relative height of the pallet and the fork, and whether the pallet is loaded or not, it was confirmed that the error was within the allowable range for safely inserting the fork.

Chinese Translation

为了使叉车能够将叉子插入位于托盘前方的托盘孔中，需要控制叉子的高度位置、到达位置和倾斜角度，以匹配托盘孔的位置和方向。为了使自主导向叉车（AGF）能够自动完成这一过程，我们提出了一种图像测量方法，通过使用广角相机获取的图像，在相机坐标系中测量托盘的俯仰倾斜度。此外，我们还提出了一种图像测量方法，以便轻松获取相机坐标系与叉子坐标系之间的标定信息，从而将相机坐标系中的测量应用于叉子控制。在实验空间中，广角相机固定在一种伸缩叉车的靠背上。通过将托盘放置在相机前方拍摄的广角图像进行了处理。通过比较图像测量值与手动测量值，评估在改变托盘俯仰倾斜角度、托盘与叉子的相对高度以及托盘是否装载的情况下的误差，确认误差在安全插入叉子的允许范围内。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.16182

World Model Failure Classification and Anomaly Detection for Autonomous Inspection

自主检查的世界模型故障分类与异常检测

Ho, Michelle, Ginting, Muhammad Fadhil, Ward, Isaac R., Reinke, Andrzej, Kochenderfer, Mykel J., Agha-Mohammadi, Ali-akbar, Omidshafiei, Shayegan

Abstract

Autonomous inspection robots for monitoring industrial sites can reduce costs and risks associated with human-led inspection. However, accurate readings can be challenging due to occlusions, limited viewpoints, or unexpected environmental conditions. We propose a hybrid framework that combines supervised failure classification with anomaly detection, enabling classification of inspection tasks as a success, known failure, or anomaly (i.e., out-of-distribution) case. Our approach uses a world model backbone with compressed video inputs. This policy-agnostic, distribution-free framework determines classifications based on two decision functions set by conformal prediction (CP) thresholds before a human observer does. We evaluate the framework on gauge inspection feeds collected from office and industrial sites and demonstrate real-time deployment on a Boston Dynamics Spot. Experiments show over 90% accuracy in distinguishing between successes, failures, and OOD cases, with classifications occurring earlier than a human observer. These results highlight the potential for robust, anticipatory failure detection in autonomous inspection tasks or as a feedback signal for model training to assess and improve the quality of training data. Project website: https://autoinspection-classification.github.io

Chinese Translation

用于监测工业现场的自主检查机器人可以降低与人工检查相关的成本和风险。然而，由于遮挡、视角有限或意外的环境条件，准确读取可能会面临挑战。我们提出了一种混合框架，将监督故障分类与异常检测相结合，使得检查任务能够被分类为成功、已知故障或异常（即，分布外）情况。我们的方法使用了一个以世界模型为基础的框架，并结合压缩视频输入。该框架不依赖于特定策略且不受分布限制，基于由符合预测（Conformal Prediction, CP）阈值设定的两个决策函数来确定分类，优于人类观察者的判断。我们在从办公室和工业现场收集的测量检查数据上评估了该框架，并在波士顿动力公司的Spot机器人上展示了实时部署。实验结果表明，在区分成功、失败和分布外（OOD）情况方面的准确率超过90%，且分类的速度早于人类观察者。这些结果突显了在自主检查任务中进行稳健、前瞻性故障检测的潜力，或作为模型训练的反馈信号，以评估和改善训练数据的质量。项目网站：https://autoinspection-classification.github.io

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.16187

SIT-LMPC: Safe Information-Theoretic Learning Model Predictive Control for Iterative Tasks

SIT-LMPC：用于迭代任务的安全信息理论学习模型预测控制

Zang, Zirui, Amine, Ahmad, Kokolakis, Nick-Marios T., Nghiem, Truong X., Rosolia, Ugo, Mangharam, Rahul

Abstract

Robots executing iterative tasks in complex, uncertain environments require control strategies that balance robustness, safety, and high performance. This paper introduces a safe information-theoretic learning model predictive control (SIT-LMPC) algorithm for iterative tasks. Specifically, we design an iterative control framework based on an information-theoretic model predictive control algorithm to address a constrained infinite-horizon optimal control problem for discrete-time nonlinear stochastic systems. An adaptive penalty method is developed to ensure safety while balancing optimality. Trajectories from previous iterations are utilized to learn a value function using normalizing flows, which enables richer uncertainty modeling compared to Gaussian priors. SIT-LMPC is designed for highly parallel execution on graphics processing units, allowing efficient real-time optimization. Benchmark simulations and hardware experiments demonstrate that SIT-LMPC iteratively improves system performance while robustly satisfying system constraints.

Chinese Translation

在复杂且不确定的环境中执行迭代任务的机器人需要平衡鲁棒性、安全性和高性能的控制策略。本文提出了一种用于迭代任务的安全信息理论学习模型预测控制（SIT-LMPC）算法。具体而言，我们设计了一个基于信息理论模型预测控制算法的迭代控制框架，以解决离散时间非线性随机系统的约束无限时域最优控制问题。开发了一种自适应惩罚方法，以确保安全性的同时平衡最优性。利用先前迭代的轨迹，通过归一化流学习价值函数，这使得与高斯先验相比，能够实现更丰富的不确定性建模。SIT-LMPC被设计为在图形处理单元上高度并行执行，从而实现高效的实时优化。基准仿真和硬件实验表明，SIT-LMPC能够迭代地提高系统性能，同时稳健地满足系统约束。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.16206

Nonplanar Model Predictive Control for Autonomous Vehicles with Recursive Sparse Gaussian Process Dynamics

针对非平面地形的自主车辆非平面模型预测控制

Amine, Ahmad, Puri, Kabir, Le, Viet-Anh, Mangharam, Rahul

Abstract

This paper proposes a nonplanar model predictive control (MPC) framework for autonomous vehicles operating on nonplanar terrain. To approximate complex vehicle dynamics in such environments, we develop a geometry-aware modeling approach that learns a residual Gaussian Process (GP). By utilizing a recursive sparse GP, the framework enables real-time adaptation to varying terrain geometry. The effectiveness of the learned model is demonstrated in a reference-tracking task using a Model Predictive Path Integral (MPPI) controller. Validation within a custom Isaac Sim environment confirms the framework's capability to maintain high tracking accuracy on challenging 3D surfaces.

Chinese Translation

本文提出了一种针对在非平面地形上运行的自主车辆的非平面模型预测控制（MPC）框架。为了近似此类环境中的复杂车辆动态，我们开发了一种几何感知建模方法，该方法学习残差高斯过程（GP）。通过利用递归稀疏高斯过程，该框架能够实时适应变化的地形几何。通过使用模型预测路径积分（MPPI）控制器，在参考跟踪任务中验证了所学模型的有效性。在自定义的Isaac Sim环境中的验证确认了该框架在具有挑战性的三维表面上维持高跟踪精度的能力。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.16308

Markerless Robot Detection and 6D Pose Estimation for Multi-Agent SLAM

无标记机器人检测与多智能体SLAM的6D位姿估计

Rueggeberg, Markus, Ulmer, Maximilian, Durner, Maximilian, Boerdijk, Wout, Mueller, Marcus Gerhard, Triebel, Rudolph, Giubilato, Riccardo

Abstract

The capability of multi-robot SLAM approaches to merge localization history and maps from different observers is often challenged by the difficulty in establishing data association. Loop closure detection between perceptual inputs of different robotic agents is easily compromised in the context of perceptual aliasing, or when perspectives differ significantly. For this reason, direct mutual observation among robots is a powerful way to connect partial SLAM graphs, but often relies on the presence of calibrated arrays of fiducial markers (e.g., AprilTag arrays), which severely limits the range of observations and frequently fails under sharp lighting conditions, e.g., reflections or overexposure. In this work, we propose a novel solution to this problem leveraging recent advances in Deep-Learning-based 6D pose estimation. We feature markerless pose estimation as part of a decentralized multi-robot SLAM system and demonstrate the benefit to the relative localization accuracy among the robotic team. The solution is validated experimentally on data recorded in a test field campaign on a planetary analogous environment.

Chinese Translation

多机器人SLAM方法在合并来自不同观察者的定位历史和地图时，常常面临建立数据关联的困难。在不同机器人代理的感知输入之间进行回环闭合检测时，容易受到感知混淆的影响，或在视角显著不同的情况下受到妨碍。因此，机器人之间的直接相互观察是一种有效连接部分SLAM图的方法，但通常依赖于经过校准的标记阵列（例如，AprilTag阵列），这严重限制了观察范围，并且在强光照条件下（例如，反射或过度曝光）常常失败。在本研究中，我们提出了一种新颖的解决方案，利用近期在基于深度学习的6D位姿估计方面的进展。我们将无标记位姿估计作为去中心化多机器人SLAM系统的一部分，并展示其对机器人团队相对定位精度的益处。该解决方案在一个类行星环境的测试场景中通过实验验证。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.16330

Machine Learning Driven Prediction of the Behavior of Biohybrid Actuators

基于机器学习的生物混合驱动器行为预测

Tsompanas, Michail-Antisthenis, Hernandez, Marco Perez, Abdul-Fattah, Faisal, Elhakim, Karim, Ibrahim, Mostafa, Fuentes, Judith, Lezcano, Florencia, Collu, Riccardo, Barbaro, Massimo, Lai, Stefano, Sanchez, Samuel, Adamatzky, Andrew

Abstract

Skeletal muscle-based biohybrid actuators have proved to be a promising component in soft robotics, offering efficient movement. However, their intrinsic biological variability and nonlinearity pose significant challenges for controllability and predictability. To address these issues, this study investigates the application of supervised learning, a form of machine learning, to model and predict the behavior of biohybrid machines (BHMs), focusing on a muscle ring anchored on flexible polymer pillars. First, static prediction models (i.e., random forest and neural network regressors) are trained to estimate the maximum exerted force achieved from input variables such as muscle sample, electrical stimulation parameters, and baseline exerted force. Second, a dynamic modeling framework, based on Long Short-Term Memory networks, is developed to serve as a digital twin, replicating the time series of exerted forces observed in response to electrical stimulation. Both modeling approaches demonstrate high predictive accuracy. The best performance of the static models is characterized by R2 of 0.9425, whereas the dynamic model achieves R2 of 0.9956. The static models can enable optimization of muscle actuator performance for targeted applications and required force outcomes, while the dynamic model provides a foundation for developing robustly adaptive control strategies in future biohybrid robotic systems.

Chinese Translation

基于骨骼肌的生物混合驱动器在软体机器人领域中被证明是一种有前景的组件，能够提供高效的运动。然而，它们固有的生物变异性和非线性特征对可控性和可预测性构成了重大挑战。为了解决这些问题，本研究探讨了监督学习（supervised learning）这一机器学习形式在建模和预测生物混合机器（biohybrid machines, BHMs）行为中的应用，重点关注锚定在柔性聚合物支柱上的肌肉环。首先，训练静态预测模型（即随机森林（random forest）和神经网络回归器（neural network regressors）），以估算由肌肉样本、电刺激参数和基线施加力等输入变量所产生的最大施加力。其次，开发基于长短期记忆网络（Long Short-Term Memory networks）的动态建模框架，作为数字双胞胎（digital twin），复制对电刺激反应所观察到的施加力时间序列。这两种建模方法均表现出高预测精度。静态模型的最佳性能特征为R²值为0.9425，而动态模型则达到R²值0.9956。静态模型能够优化肌肉驱动器在特定应用和所需施加力结果下的性能，而动态模型为未来生物混合机器人系统中开发稳健的自适应控制策略奠定了基础。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.16353

Dual-Quadruped Collaborative Transportation in Narrow Environments via Safe Reinforcement Learning

基于安全强化学习的狭窄环境下双四足机器人协作运输

Lei, Zhezhi, Bi, Zhihai, Wang, Wenxin, Ma, Jun

Abstract

Collaborative transportation, where multiple robots collaboratively transport a payload, has garnered significant attention in recent years. While ensuring safe and high-performance inter-robot collaboration is critical for effective task execution, it is difficult to pursue in narrow environments where the feasible region is extremely limited. To address this challenge, we propose a novel approach for dual-quadruped collaborative transportation via safe reinforcement learning (RL). Specifically, we model the task as a fully cooperative constrained Markov game, where collision avoidance is formulated as constraints. We introduce a cost-advantage decomposition method that enforces the sum of team constraints to remain below an upper bound, thereby guaranteeing task safety within an RL framework. Furthermore, we propose a constraint allocation method that assigns shared constraints to individual robots to maximize the overall task reward, encouraging autonomous task-assignment among robots, thereby improving collaborative task performance. Simulation and real-time experimental results demonstrate that the proposed approach achieves superior performance and a higher success rate in dual-quadruped collaborative transportation compared to existing methods.

Chinese Translation

协作运输是指多个机器人协同运输负载，近年来受到了广泛关注。在有效执行任务的过程中，确保机器人间的安全和高性能协作至关重要，但在可行区域极其有限的狭窄环境中，这一目标难以实现。为了解决这一挑战，我们提出了一种基于安全强化学习（RL）的双四足机器人协作运输新方法。具体而言，我们将任务建模为一个完全合作的约束马尔可夫游戏，其中碰撞避免被形式化为约束条件。我们引入了一种成本-优势分解方法，强制团队约束的总和保持在上限以下，从而在RL框架内保证任务安全。此外，我们提出了一种约束分配方法，将共享约束分配给各个机器人，以最大化整体任务奖励，鼓励机器人之间的自主任务分配，从而提高协作任务的表现。仿真和实时实验结果表明，与现有方法相比，所提出的方法在双四足机器人协作运输中实现了更优的性能和更高的成功率。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.16356

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

用于开放世界移动操作的关节式三维场景图

Büchner, Martin, Röfer, Adrian, Engelbracht, Tim, Welschehold, Tim, Bauer, Zuria, Blum, Hermann, Pollefeys, Marc, Valada, Abhinav

Abstract

Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: https://momasg.cs.uni-freiburg.de.

Chinese Translation

语义学使得三维场景理解和基于可供性的物体交互成为可能。然而，在真实环境中操作的机器人面临一个关键限制：它们无法预测物体的运动。长时间跨度的移动操作需要弥合语义、几何和运动学之间的差距。在本研究中，我们提出了MoMa-SG，这是一个用于构建包含众多可交互物体的关节式场景的语义-运动学三维场景图的新框架。给定包含多个物体关节运动的RGB-D序列，我们对物体交互进行时间分段，并使用抗遮挡的点跟踪推断物体运动。然后，我们将点轨迹提升到三维，并使用一种新颖的统一扭转估计公式估计关节模型，该公式在单次优化过程中稳健地估计旋转和滑动关节参数。接下来，我们将物体与估计的关节关联，并通过推理识别的开启状态下的父子关系来检测包含的物体。我们还引入了新颖的Arti4D-Semantic数据集，该数据集独特地结合了层次化物体语义，包括父子关系标签，以及在62个真实RGB-D序列中跨越600个物体交互和三种不同观察范式的物体轴注释。我们在两个数据集上广泛评估了MoMa-SG的性能，并对我们方法的关键设计选择进行了消融实验。此外，在四足机器人和移动操控器上的真实世界实验表明，我们的语义-运动学场景图能够在日常家庭环境中实现对关节式物体的稳健操作。我们提供代码和数据，网址为：https://momasg.cs.uni-freiburg.de。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2602.16358

System Identification under Constraints and Disturbance: A Bayesian Estimation Approach

约束与干扰下的系统辨识：一种贝叶斯估计方法

Martinez, Sergi, Tonneau, Steve, Mastalli, Carlos

Abstract

We introduce a Bayesian system identification (SysID) framework for jointly estimating robot's state trajectories and physical parameters with high accuracy. It embeds physically consistent inverse dynamics, contact and loop-closure constraints, and fully featured joint friction models as hard, stage-wise equality constraints. It relies on energy-based regressors to enhance parameter observability, supports both equality and inequality priors on inertial and actuation parameters, enforces dynamically consistent disturbance projections, and augments proprioceptive measurements with energy observations to disambiguate nonlinear friction effects. To ensure scalability, we derive a parameterized equality-constrained Riccati recursion that preserves the banded structure of the problem, achieving linear complexity in the time horizon, and develop computationally efficient derivatives. Simulation studies on representative robotic systems, together with hardware experiments on a Unitree B1 equipped with a Z1 arm, demonstrate faster convergence, lower inertial and friction estimation errors, and improved contact consistency compared to forward-dynamics and decoupled identification baselines. When deployed within model predictive control frameworks, the resulting models yield measurable improvements in tracking performance during locomotion over challenging environments.

Chinese Translation

我们提出了一种贝叶斯系统辨识（SysID）框架，用于高精度地联合估计机器人的状态轨迹和物理参数。该框架将物理一致的逆动力学、接触和回路闭合约束，以及全面的关节摩擦模型嵌入为硬性、阶段性等式约束。它依赖于基于能量的回归器来增强参数的可观测性，支持对惯性和驱动参数的等式和不等式先验，强制执行动态一致的干扰投影，并通过能量观测增强本体感知测量，以消除非线性摩擦效应的歧义。为了确保可扩展性，我们推导出一种参数化的等式约束Riccati递归，保持问题的带状结构，实现时间范围内的线性复杂性，并开发出计算效率高的导数。对代表性机器人系统的仿真研究，以及在配备Z1臂的Unitree B1上的硬件实验，展示了与前向动力学和解耦辨识基线相比，收敛速度更快、惯性和摩擦估计误差更低、接触一致性更好。当在模型预测控制框架中部署时，所得到的模型在复杂环境下的运动跟踪性能上实现了可测量的改善。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2602.16360

Docking and Persistent Operations for a Resident Underwater Vehicle

驻留水下车辆的对接与持续操作

Günzel, Leonard, Kasparavičiūtė, Gabrielė, Waldum, Ambjørn Grimsrud, Moslått, Bjørn-Magnus, Badawi, Abubakar Aliyu, Yılmaz, Celil, Yousha, Md Shamin Yeasher, Staven, Robert, Ludvigsen, Martin

Abstract

Our understanding of the oceans remains limited by sparse and infrequent observations, primarily because current methods are constrained by the high cost and logistical effort of underwater monitoring, relying either on sporadic surveys across broad areas or on long-term measurements at fixed locations. To overcome these limitations, monitoring systems must enable persistent and autonomous operations without the need for continuous surface support. Despite recent advances, resident underwater vehicles remain uncommon due to persistent challenges in autonomy, robotic resilience, and mechanical robustness, particularly under long-term deployment in harsh and remote environments. This work addresses these problems by presenting the development, deployment, and operation of a resident infrastructure using a docking station with a mini-class Remotely Operated Vehicle (ROV) at 90m depth. The ROVis equipped with enhanced onboard processing and perception, allowing it to autonomously navigate using USBL signals, dock via ArUco marker-based visual localisation fused through an Extended Kalman Filter, and carry out local inspection routines. The system demonstrated a 90% autonomous docking success rate and completed full inspection missions within four minutes, validating the integration of acoustic and visual navigation in real-world conditions. These results show that reliable, untethered operations at depth are feasible, highlighting the potential of resident ROV systems for scalable, cost-effective underwater monitoring.

Chinese Translation

我们对海洋的理解仍然受到稀疏和不频繁观测的限制，主要是因为当前的方法受到水下监测的高成本和后勤努力的约束，依赖于在广泛区域内的偶发调查或在固定位置的长期测量。为了克服这些限制，监测系统必须能够实现持续和自主的操作，而无需持续的水面支持。尽管最近取得了一些进展，驻留水下车辆仍然不常见，主要是由于在自主性、机器人韧性和机械可靠性方面面临持续挑战，尤其是在恶劣和偏远环境下的长期部署。本文通过展示使用一个在90米深度的对接站与迷你级遥控水下机器人（ROV）进行驻留基础设施的开发、部署和操作，来解决这些问题。该ROV配备了增强的机载处理和感知能力，使其能够使用超声波定位（USBL）信号自主导航，通过基于ArUco标记的视觉定位与扩展卡尔曼滤波器融合进行对接，并执行局部检查程序。该系统展示了90%的自主对接成功率，并在四分钟内完成了全面检查任务，验证了声学与视觉导航在实际条件下的集成。这些结果表明，在深水中可靠的无缆操作是可行的，突显了驻留ROV系统在可扩展和具有成本效益的水下监测中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2602.16365

Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

无标记的6D位姿估计与基于位置的视觉伺服控制在内窥镜连续操控器中的应用

Park, Junhyun, An, Chunggil, Park, Myeongbo, Ullah, Ihsan, Park, Sihyeong, Hwang, Minho

Abstract

Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76{\deg} across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41{\deg}, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.

Chinese Translation

柔性内窥镜手术系统中的连续操控器为微创手术提供了高灵活性；然而，由于滞后性、顺应性和有限的远端传感，准确的位姿估计和闭环控制仍然具有挑战性。基于视觉的方法减少了硬件复杂性，但通常受到有限几何可观测性和高计算开销的限制，限制了实时闭环应用的可行性。本文提出了一种统一的框架，用于无标记立体6D位姿估计和连续操控器的基于位置的视觉伺服控制。一个照片级真实感的仿真管道使得大规模自动训练成为可能，并提供像素级准确的标注。一个立体感知的多特征融合网络共同利用分割掩膜、关键点、热图和边界框来增强几何可观测性。为了在不进行迭代优化的情况下强制几何一致性，一个前馈渲染基础的精细化模块在单次传递中预测残余位姿修正。自监督的仿真到现实适应策略进一步利用未标记数据提高现实世界的性能。广泛的现实世界验证实现了1,000个样本的平均平移误差为0.83毫米，平均旋转误差为2.76°。由估计位姿驱动的无标记闭环视觉伺服控制实现了准确的轨迹跟踪，平均平移误差为2.07毫米，平均旋转误差为7.41°，与开环控制相比，分别减少了85%和59%，并在重复点达成任务中表现出高重复性。根据我们所知，这项工作首次提出了完全无标记的位姿估计驱动的基于位置的视觉伺服控制框架，能够在没有物理标记或嵌入式传感的情况下实现精确的闭环控制。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2602.16371

Dynamic Modeling and MPC for Locomotion of Tendon-Driven Soft Quadruped

基于动态建模和模型预测控制的腱驱动软四足动物运动研究

Karan, Saumya, Maram, Neerav, Borate, Suraj, Vadali, Madhu

Abstract

SLOT (Soft Legged Omnidirectional Tetrapod), a tendon-driven soft quadruped robot with 3D-printed TPU legs, is presented to study physics-informed modeling and control of compliant legged locomotion using only four actuators. Each leg is modeled as a deformable continuum using discrete Cosserat rod theory, enabling the capture of large bending deformations, distributed elasticity, tendon actuation, and ground contact interactions. A modular whole-body modeling framework is introduced, in which compliant leg dynamics are represented through physically consistent reaction forces applied to a rigid torso, providing a scalable interface between continuum soft limbs and rigid-body locomotion dynamics. This formulation allows efficient whole-body simulation and real-time control without sacrificing physical fidelity. The proposed model is embedded into a convex model predictive control framework that optimizes ground reaction forces over a 0.495 s prediction horizon and maps them to tendon actuation through a physics-informed force-angle relationship. The resulting controller achieves asymptotic stability under diverse perturbations. The framework is experimentally validated on a physical prototype during crawling and walking gaits, achieving high accuracy with less than 5 mm RMSE in center of mass trajectories. These results demonstrate a generalizable approach for integrating continuum soft legs into model-based locomotion control, advancing scalable and reusable modeling and control methods for soft quadruped robots.

Chinese Translation

本文介绍了一种腱驱动的软四足机器人SLOT（软腿全向四足动物），其腿部采用3D打印的热塑性聚氨酯（TPU）材料，旨在研究仅使用四个驱动器的柔性腿部运动的物理信息建模与控制。每条腿部被建模为可变形的连续体，采用离散的科塞拉特杆理论，能够捕捉到大幅度的弯曲变形、分布式弹性、腱驱动和地面接触相互作用。我们引入了一种模块化的全身建模框架，其中柔性腿部的动力学通过施加在刚性躯干上的物理一致的反作用力来表示，提供了连续软肢与刚体运动动力学之间的可扩展接口。这种公式化方法允许高效的全身仿真和实时控制，而不牺牲物理真实性。所提出的模型嵌入到一个凸模型预测控制框架中，该框架在0.495秒的预测范围内优化地面反作用力，并通过物理信息驱动的力-角关系将其映射到腱驱动上。所得到的控制器在多种扰动下实现了渐近稳定性。该框架在物理原型上进行了实验验证，涵盖爬行和行走姿态，在质心轨迹中实现了高精度，均方根误差（RMSE）小于5毫米。这些结果展示了一种将连续软腿整合到基于模型的运动控制中的可推广方法，推动了软四足机器人可扩展和可重用的建模与控制方法的发展。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2602.16444

RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation

RoboGene：通过多样性驱动的自主框架提升VLA预训练以生成真实世界任务

Zhang, Yixue, Wu, Kun, Gao, Zhi, Zhao, Zhen, Ren, Pei, Xu, Zhiyuan, Liao, Fei, Wang, Xinhua, Fan, Shichao, Wu, Di, Feng, Qiuxuan, Li, Meng, Che, Zhengping, Liu, Chang, Tang, Jian

Abstract

The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet under-explored challenge. Existing manual methods are unscalable and biased toward common tasks, while off-the-shelf foundation models often hallucinate physically infeasible instructions. To address this, we introduce RoboGene, an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks across single-arm, dual-arm, and mobile robots. RoboGene integrates three core components: diversity-driven sampling for broad task coverage, self-reflection mechanisms to enforce physical constraints, and human-in-the-loop refinement for continuous improvement. We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories and introducing novel metrics to assess task quality, feasibility, and diversity. Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models (e.g., GPT-4o, Gemini 2.5 Pro). Furthermore, real-world experiments show that VLA models pre-trained with RoboGene achieve higher success rates and superior generalization, underscoring the importance of high-quality task generation. Our project is available at https://robogene-boost-vla.github.io.

Chinese Translation

通用机器人操作的追求受到多样化真实世界交互数据稀缺的制约。与视觉或语言领域从网络收集数据不同，机器人数据收集是一个主动过程，涉及巨大的物理成本。因此，自动化任务策划以最大化数据价值仍然是一个关键但尚未充分探索的挑战。现有的手动方法不可扩展且偏向于常见任务，而现成的基础模型往往会产生物理上不可行的指令。为了解决这个问题，我们提出了RoboGene，一个旨在自动生成多样化、物理上合理的操作任务的自主框架，适用于单臂、双臂和移动机器人。RoboGene集成了三个核心组件：驱动多样性的采样以实现广泛的任务覆盖、自我反思机制以强制执行物理约束，以及人机协作的精细化以实现持续改进。我们进行了广泛的定量分析和大规模的真实世界实验，收集了18,000条轨迹的数据集，并引入了新的指标来评估任务质量、可行性和多样性。结果表明，RoboGene显著优于最先进的基础模型（例如，GPT-4o、Gemini 2.5 Pro）。此外，真实世界实验表明，使用RoboGene进行预训练的VLA模型实现了更高的成功率和更好的泛化能力，强调了高质量任务生成的重要性。我们的项目可在 https://robogene-boost-vla.github.io 获取。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2602.16462

Reactive Motion Generation With Particle-Based Perception in Dynamic Environments

基于粒子感知的动态环境反应运动生成

Zhao, Xiyuan, Li, Huijun, Zhu, Lifeng, Wei, Zhikai, Zhu, Xianyi, Song, Aiguo

Abstract

Reactive motion generation in dynamic and unstructured scenarios is typically subject to essentially static perception and system dynamics. Reliably modeling dynamic obstacles and optimizing collision-free trajectories under perceptive and control uncertainty are challenging. This article focuses on revealing tight connection between reactive planning and dynamic mapping for manipulators from a model-based perspective. To enable efficient particle-based perception with expressively dynamic property, we present a tensorized particle weight update scheme that explicitly maintains obstacle velocities and covariance meanwhile. Building upon this dynamic representation, we propose an obstacle-aware MPPI-based planning formulation that jointly propagates robot-obstacle dynamics, allowing future system motion to be predicted and evaluated under uncertainty. The model predictive method is shown to significantly improve safety and reactivity with dynamic surroundings. By applying our complete framework in simulated and noisy real-world environments, we demonstrate that explicit modeling of robot-obstacle dynamics consistently enhances performance over state-of-the-art MPPI-based perception-planning baselines avoiding multiple static and dynamic obstacles.

Chinese Translation

在动态和非结构化场景中，反应运动生成通常受到基本静态感知和系统动态的限制。可靠地建模动态障碍物并在感知和控制不确定性下优化无碰撞轨迹是具有挑战性的。本文从模型驱动的角度着重揭示了反应规划与动态映射之间的紧密联系。为了实现高效的基于粒子的感知，并具备表现出动态特性的能力，我们提出了一种张量化粒子权重更新方案，该方案明确地同时维护障碍物的速度和协方差。在此动态表示的基础上，我们提出了一种障碍物感知的基于模型预测控制的规划公式，该公式联合传播机器人与障碍物的动态，使得未来系统运动能够在不确定性下进行预测和评估。模型预测方法显著提高了在动态环境中的安全性和反应能力。通过在模拟和嘈杂的真实环境中应用我们的完整框架，我们证明了对机器人与障碍物动态的显式建模在避免多个静态和动态障碍物方面，始终提升了性能，相较于最先进的基于MPPI的感知-规划基线。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2602.16511

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

VIGOR：统一人形机器人跌倒安全的视觉目标上下文推理

Azulay, Osher, Xu, Zhengjie, Scheffer, Andrew, Yu, Stella X.

Abstract

Reliable fall recovery is critical for humanoids operating in cluttered environments. Unlike quadrupeds or wheeled robots, humanoids experience high-energy impacts, complex whole-body contact, and large viewpoint changes during a fall, making recovery essential for continued operation. Existing methods fragment fall safety into separate problems such as fall avoidance, impact mitigation, and stand-up recovery, or rely on end-to-end policies trained without vision through reinforcement learning or imitation learning, often on flat terrain. At a deeper level, fall safety is treated as monolithic data complexity, coupling pose, dynamics, and terrain and requiring exhaustive coverage, limiting scalability and generalization. We present a unified fall safety approach that spans all phases of fall recovery. It builds on two insights: 1) Natural human fall and recovery poses are highly constrained and transferable from flat to complex terrain through alignment, and 2) Fast whole-body reactions require integrated perceptual-motor representations. We train a privileged teacher using sparse human demonstrations on flat terrain and simulated complex terrains, and distill it into a deployable student that relies only on egocentric depth and proprioception. The student learns how to react by matching the teacher's goal-in-context latent representation, which combines the next target pose with the local terrain, rather than separately encoding what it must perceive and how it must act. Results in simulation and on a real Unitree G1 humanoid demonstrate robust, zero-shot fall safety across diverse non-flat environments without real-world fine-tuning. The project page is available at https://vigor2026.github.io/

Chinese Translation

可靠的跌倒恢复对于在杂乱环境中操作的人形机器人至关重要。与四足动物或轮式机器人不同，人形机器人在跌倒过程中会经历高能量冲击、复杂的全身接触以及较大的视角变化，因此恢复能力对于持续操作至关重要。现有方法将跌倒安全分解为跌倒避免、冲击缓解和站立恢复等独立问题，或依赖于通过强化学习或模仿学习在没有视觉的情况下训练的端到端策略，通常是在平坦地形上进行的。从更深层次来看，跌倒安全被视为单一的数据复杂性，耦合了姿态、动力学和地形，并需要全面覆盖，这限制了可扩展性和泛化能力。我们提出了一种统一的跌倒安全方法，涵盖了跌倒恢复的所有阶段。该方法基于两个见解：1）自然的人类跌倒和恢复姿态高度受限，并且可以通过对齐从平坦地形转移到复杂地形；2）快速的全身反应需要集成的感知-运动表征。我们使用稀疏的人类示范在平坦地形和模拟复杂地形上训练一个特权教师，并将其提炼为一个仅依赖于自我中心深度和本体感觉的可部署学生。该学生通过匹配教师的目标上下文潜在表征来学习如何反应，该表征将下一个目标姿态与局部地形结合，而不是单独编码其必须感知的内容和必须采取的行动。模拟和真实Unitree G1人形机器人上的结果表明，在多样化的非平坦环境中实现了稳健的零样本跌倒安全，而无需现实世界的微调。项目页面可访问 https://vigor2026.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2602.16594

Decentralized and Fully Onboard: Range-Aided Cooperative Localization and Navigation on Micro Aerial Vehicles

去中心化与完全在位：微型空中飞行器的范围辅助协作定位与导航

Goudar, Abhishek, Schoellig, Angela P.

Abstract

Controlling a team of robots in a coordinated manner is challenging because centralized approaches (where all computation is performed on a central machine) scale poorly, and globally referenced external localization systems may not always be available. In this work, we consider the problem of range-aided decentralized localization and formation control. In such a setting, each robot estimates its relative pose by combining data only from onboard odometry sensors and distance measurements to other robots in the team. Additionally, each robot calculates the control inputs necessary to collaboratively navigate an environment to accomplish a specific task, for example, moving in a desired formation while monitoring an area. We present a block coordinate descent approach to localization that does not require strict coordination between the robots. We present a novel formulation for formation control as inference on factor graphs that takes into account the state estimation uncertainty and can be solved efficiently. Our approach to range-aided localization and formation-based navigation is completely decentralized, does not require specialized trajectories to maintain formation, and achieves decimeter-level positioning and formation control accuracy. We demonstrate our approach through multiple real experiments involving formation flights in diverse indoor and outdoor environments.

Chinese Translation

以协调的方式控制一组机器人是具有挑战性的，因为集中式方法（所有计算在中央机器上进行）扩展性差，并且全球参考的外部定位系统并不总是可用。在本研究中，我们考虑范围辅助的去中心化定位和队形控制问题。在这种情况下，每个机器人通过仅结合来自机载里程计传感器的数据和与团队中其他机器人之间的距离测量来估计其相对姿态。此外，每个机器人计算协同导航环境所需的控制输入，以完成特定任务，例如在监控区域的同时以期望的队形移动。我们提出了一种块坐标下降方法进行定位，该方法不需要机器人之间的严格协调。我们提出了一种新的队形控制公式，将其视为在因子图上的推理，考虑了状态估计的不确定性，并且可以高效求解。我们的方法实现的范围辅助定位和基于队形的导航是完全去中心化的，不需要专门的轨迹来维持队形，并且实现了分米级的定位和队形控制精度。我们通过在多种室内和室外环境中进行的队形飞行的多次真实实验展示了我们的方法。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2602.16598

Sensor Query Schedule and Sensor Noise Covariances for Accuracy-constrained Trajectory Estimation

传感器查询调度与传感器噪声协方差在精度受限的轨迹估计中的应用

Goudar, Abhishek, Schoellig, Angela P.

Abstract

Trajectory estimation involves determining the trajectory of a mobile robot by combining prior knowledge about its dynamic model with noisy observations of its state obtained using sensors. The accuracy of such a procedure is dictated by the system model fidelity and the sensor parameters, such as the accuracy of the sensor (as represented by its noise covariance) and the rate at which it can generate observations, referred to as the sensor query schedule. Intuitively, high-rate measurements from accurate sensors lead to accurate trajectory estimation. However, cost and resource constraints limit the sensor accuracy and its measurement rate. Our work's novel contribution is the estimation of sensor schedules and sensor covariances necessary to achieve a specific estimation accuracy. Concretely, we focus on estimating: (i) the rate or schedule with which a sensor of known covariance must generate measurements to achieve specific estimation accuracy, and alternatively, (ii) the sensor covariance necessary to achieve specific estimation accuracy for a given sensor update rate. We formulate the problem of estimating these sensor parameters as semidefinite programs, which can be solved by off-the-shelf solvers. We validate our approach in simulation and real experiments by showing that the sensor schedules and the sensor covariances calculated using our proposed method achieve the desired trajectory estimation accuracy. Our method also identifies scenarios where certain estimation accuracy is unachievable with the given system and sensor characteristics.

Chinese Translation

轨迹估计涉及通过结合关于移动机器人动态模型的先验知识与使用传感器获得的状态噪声观测来确定其轨迹。这一过程的准确性受系统模型的保真度和传感器参数的影响，例如传感器的准确性（由其噪声协方差表示）以及其生成观测的速率，后者称为传感器查询调度。直观上，来自高精度传感器的高频率测量会导致准确的轨迹估计。然而，成本和资源限制了传感器的准确性及其测量速率。我们工作的创新贡献在于估计实现特定估计准确性所需的传感器调度和传感器协方差。具体而言，我们专注于估计：(i) 具有已知协方差的传感器必须以何种速率或调度生成测量以实现特定估计准确性，或者(ii) 在给定传感器更新速率下，实现特定估计准确性所需的传感器协方差。我们将估计这些传感器参数的问题形式化为半正定规划，可以通过现成的求解器解决。我们通过模拟和实际实验验证了我们的方法，显示出使用我们提出的方法计算的传感器调度和传感器协方差能够实现所需的轨迹估计准确性。我们的方法还识别出在给定系统和传感器特性下某些估计准确性无法实现的情形。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2602.16641

Towards Autonomous Robotic Kidney Ultrasound: Spatial-Efficient Volumetric Imaging via Template Guided Optimal Pivoting

迈向自主机器人肾脏超声：通过模板引导的最优旋转实现空间高效的体积成像

Ma, Xihan, Zhang, Haichong

Abstract

Medical ultrasound (US) imaging is a frontline tool for the diagnosis of kidney diseases. However, traditional freehand imaging procedure suffers from inconsistent, operator-dependent outcomes, lack of 3D localization information, and risks of work-related musculoskeletal disorders. While robotic ultrasound (RUS) systems offer the potential for standardized, operator-independent 3D kidney data acquisition, the existing scanning methods lack the ability to determine the optimal imaging window for efficient imaging. As a result, the scan is often blindly performed with excessive probe footprint, which frequently leads to acoustic shadowing and incomplete organ coverage. Consequently, there is a critical need for a spatially efficient imaging technique that can maximize the kidney coverage through minimum probe footprint. Here, we propose an autonomous workflow to achieve efficient kidney imaging via template-guided optimal pivoting. The system first performs an explorative imaging to generate partial observations of the kidney. This data is then registered to a kidney template to estimate the organ pose. With the kidney localized, the robot executes a fixed-point pivoting sweep where the imaging plane is aligned with the kidney long axis to minimize the probe translation. The proposed method was validated in simulation and in-vivo. Simulation results indicate that a 60% exploration ratio provides optimal balance between kidney localization accuracy and scanning efficiency. In-vivo evaluation on two male subjects demonstrates a kidney localization accuracy up to 7.36 mm and 13.84 degrees. Moreover, the optimal pivoting approach shortened the probe footprint by around 75 mm when compared with the baselines. These results valid our approach of leveraging anatomical templates to align the probe optimally for volumetric sweep.

Chinese Translation

医学超声（US）成像是肾脏疾病诊断的前沿工具。然而，传统的自由手成像过程存在结果不一致、依赖操作员、缺乏三维定位信息以及工作相关的肌肉骨骼疾病风险等问题。尽管机器人超声（RUS）系统提供了标准化、独立于操作员的三维肾脏数据采集的潜力，但现有的扫描方法缺乏确定高效成像的最佳成像窗口的能力。因此，扫描通常是在过大的探头足迹下盲目进行，这常常导致声学阴影和不完整的器官覆盖。因此，迫切需要一种空间高效的成像技术，通过最小的探头足迹最大化肾脏覆盖。在此，我们提出了一种自主工作流程，通过模板引导的最优旋转实现高效的肾脏成像。该系统首先进行探索性成像，以生成肾脏的部分观察数据。然后将这些数据注册到肾脏模板上，以估计器官的姿态。在肾脏定位后，机器人执行固定点旋转扫描，其中成像平面与肾脏长轴对齐，以最小化探头的移动。所提出的方法在仿真和体内实验中得到了验证。仿真结果表明，60%的探索比例在肾脏定位精度和扫描效率之间提供了最佳平衡。对两名男性受试者的体内评估显示，肾脏定位精度达到7.36毫米和13.84度。此外，与基线相比，最优旋转方法将探头足迹缩短了约75毫米。这些结果验证了我们利用解剖模板以最佳方式对齐探头进行体积扫描的方法。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2602.16675

Learning to unfold cloth: Scaling up world models to deformable object manipulation

学习展开布料：将世界模型扩展到可变形物体操控

Rome, Jack, James, Stephen, Ramamoorthy, Subramanian

Abstract

Learning to manipulate cloth is both a paradigmatic problem for robotic research and a problem of immediate relevance to a variety of applications ranging from assistive care to the service industry. The complex physics of the deformable object makes this problem of cloth manipulation nontrivial. In order to create a general manipulation strategy that addresses a variety of shapes, sizes, fold and wrinkle patterns, in addition to the usual problems of appearance variations, it becomes important to carefully consider model structure and their implications for generalisation performance. In this paper, we present an approach to in-air cloth manipulation that uses a variation of a recently proposed reinforcement learning architecture, DreamerV2. Our implementation modifies this architecture to utilise surface normals input, in addition to modiying the replay buffer and data augmentation procedures. Taken together these modifications represent an enhancement to the world model used by the robot, addressing the physical complexity of the object being manipulated by the robot. We present evaluations both in simulation and in a zero-shot deployment of the trained policies in a physical robot setup, performing in-air unfolding of a variety of different cloth types, demonstrating the generalisation benefits of our proposed architecture.

Chinese Translation

学习操控布料不仅是机器人研究中的一个典型问题，也是与多种应用直接相关的问题，涵盖从辅助护理到服务行业等领域。可变形物体的复杂物理特性使得布料操控问题变得非同寻常。为了创建一种通用的操控策略，以应对各种形状、尺寸、折叠和皱纹模式，以及外观变化等常见问题，仔细考虑模型结构及其对泛化性能的影响变得尤为重要。本文提出了一种空中布料操控的方法，该方法使用了一种最近提出的强化学习架构DreamerV2的变体。我们的实现对该架构进行了修改，以利用表面法线输入，并对重放缓冲区和数据增强过程进行了调整。这些修改共同代表了对机器人所使用的世界模型的增强，解决了被操控物体的物理复杂性。我们在模拟环境中以及在物理机器人设置中的零-shot部署中进行了评估，执行了多种布料类型的空中展开，展示了我们所提架构的泛化优势。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2602.16705

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

学习类人机器人末端执行器控制以实现开放词汇视觉运动操控

Dong, Runpei, Li, Ziyan, He, Xialin, Gupta, Saurabh

Abstract

Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.

Chinese Translation

在野外使用类人机器人对任意物体进行视觉运动操控需要准确的末端执行器（EE）控制以及通过视觉输入（例如，RGB-D图像）对场景的可泛化理解。现有方法基于现实世界的模仿学习，由于收集大规模训练数据集的困难，表现出有限的泛化能力。本文提出了一种新的范式HERO，用于类人机器人进行物体运动操控，它结合了大型视觉模型的强泛化能力和开放词汇理解能力与来自模拟训练的强控制性能。我们通过设计一种准确的残差感知EE跟踪策略来实现这一目标。该EE跟踪策略结合了经典机器人技术与机器学习。它使用a）逆向运动学将残差末端执行器目标转换为参考轨迹，b）学习的神经前向模型以实现准确的前向运动学，c）目标调整，以及d）重新规划。这些创新共同帮助我们将末端执行器跟踪误差降低了3.2倍。我们利用这一准确的末端执行器跟踪器构建了一个模块化的运动操控系统，在该系统中，我们使用开放词汇的大型视觉模型以实现强视觉泛化。我们的系统能够在多样的现实环境中运行，从办公室到咖啡店，机器人能够可靠地操控各种日常物体（例如，杯子、苹果、玩具），在高度范围从43厘米到92厘米的表面上。系统在模拟和现实世界中的系统性模块化和端到端测试证明了我们所提出设计的有效性。我们相信本文的进展可以为训练类人机器人与日常物体互动开辟新的途径。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2602.16710

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

EgoScale：利用多样化的自我中心人类数据扩展灵巧操作能力

Zheng, Ruijie, Niu, Dantong, Xie, Yuqi, Wang, Jing, Xu, Mengda, Jiang, Yunfan, Castañeda, Fernando, Hu, Fengyuan, Tan, You Liang, Fu, Letian, Darrell, Trevor, Huang, Furong, Zhu, Yuke, Xu, Danfei, Fan, Linxi

Abstract

Human behavior is among the most scalable sources of data for learning physical intelligence, yet how to effectively leverage it for dexterous manipulation remains unclear. While prior work demonstrates human to robot transfer in constrained settings, it is unclear whether large scale human data can support fine grained, high degree of freedom dexterous manipulation. We present EgoScale, a human to dexterous manipulation transfer framework built on large scale egocentric human data. We train a Vision Language Action (VLA) model on over 20,854 hours of action labeled egocentric human video, more than 20 times larger than prior efforts, and uncover a log linear scaling law between human data scale and validation loss. This validation loss strongly correlates with downstream real robot performance, establishing large scale human data as a predictable supervision source. Beyond scale, we introduce a simple two stage transfer recipe: large scale human pretraining followed by lightweight aligned human robot mid training. This enables strong long horizon dexterous manipulation and one shot task adaptation with minimal robot supervision. Our final policy improves average success rate by 54% over a no pretraining baseline using a 22 DoF dexterous robotic hand, and transfers effectively to robots with lower DoF hands, indicating that large scale human motion provides a reusable, embodiment agnostic motor prior.

Chinese Translation

人类行为是学习物理智能最具可扩展性的数据来源之一，但如何有效利用这些数据进行灵巧操作仍不明确。虽然先前的研究展示了在受限环境中人类到机器人的迁移，但尚不清楚大规模人类数据是否能够支持细粒度、高自由度的灵巧操作。我们提出了EgoScale，这是一个基于大规模自我中心人类数据的人类到灵巧操作的迁移框架。我们在超过20,854小时的带动作标签的自我中心人类视频上训练了一个视觉语言动作（Vision Language Action, VLA）模型，其规模超过以往工作的20倍，并揭示了人类数据规模与验证损失之间的对数线性缩放规律。该验证损失与下游真实机器人性能强相关，确立了大规模人类数据作为可预测的监督源。除了规模之外，我们还提出了一种简单的两阶段迁移方案：大规模人类预训练，随后进行轻量级对齐的人机中期训练。这使得在最小机器人监督下实现强大的长时间灵巧操作和一次性任务适应成为可能。我们的最终策略在使用22自由度灵巧机器人手的情况下，相较于无预训练基线提高了54%的平均成功率，并有效地迁移到自由度较低的机器人手上，表明大规模人类运动提供了一种可重用的、与具体实现无关的运动先验。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2602.16712

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

一只手统治所有：统一灵巧操作的典型表示

Wei, Zhenyu, Yao, Yunchao, Ding, Mingyu

Abstract

Dexterous manipulation policies today largely assume fixed hand designs, severely restricting their generalization to new embodiments with varied kinematic and structural layouts. To overcome this limitation, we introduce a parameterized canonical representation that unifies a broad spectrum of dexterous hand architectures. It comprises a unified parameter space and a canonical URDF format, offering three key advantages. 1) The parameter space captures essential morphological and kinematic variations for effective conditioning in learning algorithms. 2) A structured latent manifold can be learned over our space, where interpolations between embodiments yield smooth and physically meaningful morphology transitions. 3) The canonical URDF standardizes the action space while preserving dynamic and functional properties of the original URDFs, enabling efficient and reliable cross-embodiment policy learning. We validate these advantages through extensive analysis and experiments, including grasp policy replay, VAE latent encoding, and cross-embodiment zero-shot transfer. Specifically, we train a VAE on the unified representation to obtain a compact, semantically rich latent embedding, and develop a grasping policy conditioned on the canonical representation that generalizes across dexterous hands. We demonstrate, through simulation and real-world tasks on unseen morphologies (e.g., 81.9% zero-shot success rate on 3-finger LEAP Hand), that our framework unifies both the representational and action spaces of structurally diverse hands, providing a scalable foundation for cross-hand learning toward universal dexterous manipulation.

Chinese Translation

当前的灵巧操作策略主要假设手的设计是固定的，这严重限制了它们在具有不同运动学和结构布局的新体现上的泛化能力。为了解决这一限制，我们引入了一种参数化的典型表示，统一了广泛的灵巧手架构。该表示包括一个统一的参数空间和一个典型的URDF格式，提供了三个关键优势。1）参数空间捕捉了有效条件学习算法所需的基本形态和运动学变化。2）可以在我们的空间上学习一个结构化的潜在流形，其中不同体现之间的插值产生平滑且具有物理意义的形态过渡。3）典型的URDF标准化了动作空间，同时保留了原始URDF的动态和功能特性，从而实现高效且可靠的跨体现策略学习。我们通过广泛的分析和实验验证了这些优势，包括抓取策略重放、VAE潜在编码和跨体现零样本迁移。具体而言，我们在统一表示上训练了一个VAE，以获得紧凑且语义丰富的潜在嵌入，并开发了一个基于典型表示的抓取策略，该策略在不同的灵巧手之间具有良好的泛化能力。我们通过对未见形态（例如，在3指LEAP手上达到81.9%的零样本成功率）的模拟和现实任务展示了我们的框架统一了结构多样的手的表示和动作空间，为跨手学习提供了可扩展的基础，以实现通用的灵巧操作。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2602.15892

Egocentric Bias in Vision-Language Models

视觉语言模型中的自我中心偏差

Wang, Maijunxian, Li, Yijiang, Wang, Bingyang, Zhao, Tianwei, Ji, Ran, Gao, Qingying, Liu, Emmy, Deng, Hokin, Luo, Dezhi

Abstract

Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

Chinese Translation

视觉视角采纳——推断世界从他人视角的样子——是社会认知的基础。我们引入了 FlipSet，这是一个用于视觉语言模型中二级视觉视角采纳（Level-2 visual perspective taking, L2 VPT）的诊断基准。该任务要求模拟从另一个代理的视角进行180度的2D字符字符串旋转，将空间变换与3D场景复杂性隔离开。对103个视觉语言模型（VLMs）的评估揭示了系统性的自我中心偏差：绝大多数模型的表现低于随机水平，约四分之三的错误重现了相机视角。控制实验揭示了组合能力的缺陷——模型在孤立情况下能够实现高理论心智准确性和高于随机水平的心理旋转，但在需要整合时却表现得极为糟糕。这种分离表明当前的视觉语言模型缺乏将社会意识与空间操作绑定所需的机制，暗示了基于模型的空间推理存在根本性限制。FlipSet 提供了一个以认知为基础的测试平台，用于诊断多模态系统中的视角采纳能力。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.15903

Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

基于多变量软混合和CLIP图像-文本对齐的深伪检测

Li, Jingwei, Tong, Jiaxin, Wu, Pengfei

Abstract

The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

Chinese Translation

高度逼真的面部伪造的普及使得强健的检测方法变得必要。然而，现有的方法往往由于不同伪造技术生成样本之间显著的分布变化而面临准确性有限和泛化能力差的问题。为了解决这些挑战，我们提出了一种新颖的多变量软混合增强与CLIP引导的伪造强度估计框架（MSBA-CLIP）。我们的方法利用CLIP的多模态对齐能力来捕捉微妙的伪造痕迹。我们引入了一种多变量软混合增强（MSBA）策略，通过随机权重混合多种方法的伪造图像来合成图像，迫使模型学习可泛化的模式。此外，我们设计了一个专门的多变量伪造强度估计（MFIE）模块，明确指导模型学习与不同伪造模式和强度相关的特征。大量实验表明该方法在性能上达到了最先进水平。在领域内测试中，我们的方法在准确率和AUC上分别比最佳基线提高了3.32%和4.02%。在五个数据集的跨领域评估中，平均AUC提升了3.27%。消融研究确认了两个提出组件的有效性。尽管依赖于大型视觉-语言模型会带来更高的计算成本，但我们的工作为更具泛化性和强健性的深伪检测迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.15904

A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

基于深度学习的激光雷达超分辨率在自动驾驶中的综合调查

Goo, June Moh, Zeng, Zichao, Boehm, Jan

Abstract

LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.

Chinese Translation

激光雷达（LiDAR）传感器通常被认为是自动驾驶的关键，但高分辨率传感器仍然昂贵，而经济实惠的低分辨率传感器则产生稀疏的点云，缺失关键细节。激光雷达超分辨率通过使用深度学习来增强稀疏点云，解决了这一挑战，弥合了不同传感器类型之间的差距，并在实际部署中实现了跨传感器兼容性。本文首次对自动驾驶中的激光雷达超分辨率方法进行了全面调查。尽管实际部署的重要性不言而喻，但迄今为止尚未进行系统的评审。我们将现有方法组织为四类：基于卷积神经网络（CNN）的架构、基于模型的深度展开、隐式表示方法，以及基于Transformer和Mamba的方法。我们建立了基本概念，包括数据表示、问题表述、基准数据集和评估指标。目前的趋势包括采用范围图像表示以实现高效处理、极端模型压缩以及开发分辨率灵活的架构。近期研究优先考虑实时推断和跨传感器泛化，以便于实际部署。最后，我们识别了激光雷达超分辨率技术发展的开放挑战和未来研究方向。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.15915

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

MaS-VQA：一种基于知识的视觉问答的掩码选择框架

Mao, Xianwei, Ye, Kai, Zhou, Sheng, Zhang, Nan, Huang, Haikuan, Li, Bin, Bu, Jiajun

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

Chinese Translation

基于知识的视觉问答（KB-VQA）要求模型通过整合视觉信息与外部知识来回答问题。然而，检索到的知识往往是噪声、部分无关或与视觉内容不匹配的，而内部模型知识则难以控制和解释。对这些来源的简单聚合限制了推理的有效性并降低了答案的准确性。为了解决这个问题，我们提出了MaS-VQA，这是一种选择驱动的框架，将显式知识过滤与隐式知识推理紧密结合。MaS-VQA首先检索候选段落，并应用掩码选择机制共同修剪无关的图像区域和弱相关的知识片段，从而生成紧凑的高信号多模态知识。这种过滤后的知识随后引导内部知识在受限的语义空间中激活，实现显式和隐式知识的互补共同建模，以增强答案预测的鲁棒性。在Encyclopedic-VQA和InfoSeek上的实验表明，在多个MLLM骨干网络上均实现了一致的性能提升，而消融实验验证了选择机制有效减少噪声并增强知识利用。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.15918

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

EarthSpatialBench：多模态大型语言模型在地球影像上空间推理能力的基准测试

Xu, Zelin, Zhang, Yupu, Adhikari, Saugat, Islam, Saiful, Xiao, Tingsong, Liu, Zibo, Chen, Shigang, Yan, Da, Jiang, Zhe

Abstract

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

Chinese Translation

在多模态大型语言模型（MLLMs）中进行空间推理的基准测试因其对具身人工智能及其他需要与物理世界精确互动的智能系统的重要性而受到越来越多的关注。然而，针对地球影像的空间推理发展滞后，因为它独特地涉及在地理参考图像中定位物体，并使用视觉线索和向量几何坐标（例如，二维边界框、多线段和多边形）进行距离、方向和拓扑关系的定量推理。现有的地球影像基准主要集中在二维空间定位、图像描述和粗略空间关系（例如，简单的方向或接近线索）。它们缺乏对定量方向和距离推理、系统性拓扑关系以及超出边界框的复杂物体几何的支持。为填补这一空白，我们提出了 extbf{EarthSpatialBench}，这是一个全面的基准，用于评估MLLMs在地球影像上的空间推理能力。该基准包含超过325K的问答对，涵盖：（1）关于空间距离和方向的定性和定量推理；（2）系统性的拓扑关系；（3）单对象查询、对象对查询和组合聚合组查询；以及（4）通过文本描述、视觉叠加和明确的几何坐标（包括二维边界框、多线段和多边形）表达的对象引用。我们对开源和专有模型进行了广泛的实验，以识别MLLMs在空间推理方面的局限性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.15926

A Study on Real-time Object Detection using Deep Learning

基于深度学习的实时目标检测研究

Bose, Ankita, Bhumireddy, Jayasravani, N, Naveen

Abstract

Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.

Chinese Translation

目标检测在多个领域具有重要应用，包括人机交互、安全与视频监控、导航与道路交通监测、交通系统、工业自动化、医疗保健、增强现实（AR）和虚拟现实（VR）领域、环境监测和活动识别。在所有这些领域中，实时目标检测的应用提供了对视觉信息的动态分析，有助于即时决策。此外，先进的深度学习算法利用了目标检测领域的进展，提供了更准确和高效的解决方案。当前有一些杰出的深度学习算法用于目标检测，包括Faster R-CNN（区域卷积神经网络）、Mask R-CNN、Cascade R-CNN、YOLO（You Only Look Once）、SSD（单次多框检测器）、RetinaNet等。本文详细探讨了深度学习算法如何用于增强实时目标识别。提供了不同目标检测模型的信息、开放基准数据集，以及在各种应用中使用目标检测模型的研究。此外，还提供了对比各种策略的受控研究，并得出了一些启发性的发现。最后，提出了一些鼓舞人心的挑战和方法，作为对相关深度学习方法和目标识别进一步研究的建议。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.15927

Visual Memory Injection Attacks for Multi-Turn Conversations

多轮对话中的视觉记忆注入攻击

Schlarmann, Christian, Hein, Matthias

Abstract

Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual-memory-injection

Chinese Translation

生成性大型视觉-语言模型（LVLMs）最近取得了显著的性能提升，用户群体也在快速增长。然而，LVLMs的安全性，特别是在长上下文多轮对话的环境中，仍然未得到充分探讨。本文考虑了一个现实场景：攻击者将一张经过操控的图像上传到网络/社交媒体。一个无害的用户下载了这张图像，并将其作为输入提供给LVLM。我们设计了一种新颖的隐蔽视觉记忆注入（Visual Memory Injection, VMI）攻击，其特点是在正常提示下，LVLM表现出正常行为，但一旦用户给出触发提示，LVLM就会输出特定的预设目标信息以操控用户，例如用于对抗性营销或政治劝说。与以往关注单轮攻击的研究相比，VMI在与用户进行长时间多轮对话后仍然有效。我们在多个近期的开放权重LVLM上展示了我们的攻击。本文因此表明，在多轮对话环境中，通过扰动图像进行大规模用户操控是可行的，这呼吁LVLMs在面对这些攻击时需要更好的鲁棒性。我们在 https://github.com/chs20/visual-memory-injection 发布源代码。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.15950

Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

视觉-语言模型能看见方形吗？文本识别在三种模型家族中的空间推理中起到中介作用

Levental, Yuval

Abstract

We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.

Chinese Translation

我们提出了一个简单的实验，揭示了视觉-语言模型（VLMs）的一项基本局限性：在缺乏文本身份的情况下，无法准确定位二进制网格中的填充单元格。我们生成了十五个15x15的网格，具有不同的密度（10.7%-41.8%的填充单元格），并将每个网格呈现为两种图像类型——文本符号（. 和 #）以及没有网格线的填充方块——然后要求三种前沿的VLM（Claude Opus、ChatGPT 5.2和Gemini 3 Thinking）对其进行转录。在文本符号条件下，Claude和ChatGPT的单元格准确率约为91%，F1值为84%，而Gemini的准确率为84%，F1值为63%。在填充方块条件下，所有三个模型的准确率降至60-73%，F1值为29-39%。关键的是，所有条件都通过相同的视觉编码器——文本符号是图像，而不是标记化文本。文本与方块的F1差距在模型之间范围为34到54个点，表明VLM表现得好像它们拥有一个高保真度的文本识别通路用于空间推理，显著优于其原生的视觉通路。每个模型在方块条件下表现出不同的失败模式——系统性低估计（Claude）、大规模高估计（ChatGPT）和模板幻觉（Gemini）——但它们都共享相同的根本缺陷：对非文本视觉元素的空间定位严重退化。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.15959

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

面向位置的场景外观解耦用于双向光声显微镜配准

Wang, Yiwen, Qin, Jiahao

Abstract

High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at https://github.com/JiahaoQin/GPEReg-Net.

Chinese Translation

高速光学分辨率光声显微镜（OR-PAM）通过双向栅格扫描将成像速度提高了一倍，但引入了前向和后向扫描线之间的耦合域偏移和几何不对齐。现有的配准方法受到亮度恒定假设的限制，导致对齐质量有限，而最近的生成方法通过复杂的架构解决域偏移问题，但缺乏跨帧的时间感知。我们提出了GPEReg-Net，一种场景外观解耦框架，通过自适应实例归一化（Adaptive Instance Normalization, AdaIN）将域不变的场景特征与域特定的外观编码分离，从而实现直接的图像到图像配准，无需显式的变形场估计。为了利用顺序采集中的时间结构，我们引入了全局位置编码（Global Position Encoding, GPE）模块，该模块结合了可学习的位置嵌入、正弦编码和跨帧注意力，使网络能够利用邻近帧的上下文以提高时间一致性。在OR-PAM-Reg-4K基准测试（432个测试样本）中，GPEReg-Net实现了0.953的归一化互相关（NCC）、0.932的结构相似性指数（SSIM）和34.49dB的峰值信噪比（PSNR），在SSIM上超越了最先进的技术3.8%，在PSNR上超越了1.99dB，同时保持了竞争力的NCC。代码可在https://github.com/JiahaoQin/GPEReg-Net获取。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.15962

Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

密集人群中荷斯坦-弗里斯牛的自动重识别

Yu, Phoenix, Burghardt, Tilo, Dowsey, Andrew W, Campbell, Neill W

Abstract

Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.

Chinese Translation

荷斯坦-弗里斯牛的检测和重识别（Re-ID）方法在目标空间分离时能够很好地捕捉个体。然而，现有方法，包括基于YOLO的物种检测，在牛群紧密聚集时会失效。这在具有轮廓破坏性毛色图案的物种中尤为普遍。为了提高在这种情况下的有效性和可迁移性，我们提出了一种新的检测-分割-识别管道，该管道利用无权重本地化的开放词汇（Open-Vocabulary Weight-free Localisation）和分割任意物体（Segment Anything）模型作为预处理阶段，与Re-ID网络结合使用。为了评估我们的方法，我们发布了一组在一个工作乳制品农场拍摄的九天监控数据。我们的方法克服了在密集动物群体中的检测失效，达到了98.93%的准确率。这显著优于当前基于定向边界框的物种检测基线，准确率分别提高了47.52%和27.13%。我们还展示了无监督对比学习可以在此基础上进一步提升，达到94.82%的Re-ID准确率。我们的研究表明，在拥挤场景中的Re-ID在工作农场环境中是切实可行且可靠的，无需人工干预。我们提供了代码和数据集以便于复现。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.15967

Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning

通过自适应掩蔽和自监督学习在儿科重症监护病房进行非接触式生理监测

Salah, Mohamed Khalil Ben, Jouvet, Philippe, Noumeir, Rita

Abstract

Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.

Chinese Translation

在儿科重症监护病房（PICUs）中，持续监测生命体征对于早期发现临床恶化和有效的临床决策至关重要。然而，基于接触的传感器，如脉搏血氧仪，可能会导致皮肤刺激、增加感染风险并使患者感到不适。远程光电容积脉搏波描记法（rPPG）提供了一种无接触的替代方案，通过面部视频监测心率，但由于运动伪影、遮挡、光照变化以及实验室与临床数据之间的领域转移，其在PICUs中的应用仍然不足。我们提出了一种基于渐进式课程策略的rPPG估计自监督预训练框架，适用于PICU环境。该方法利用VisionMamba架构，并集成了一种自适应掩蔽机制，其中轻量级的基于Mamba的控制器分配时空重要性评分，以指导概率性补丁采样。该策略在保持生理相关性的同时，动态增加重建难度。为了解决缺乏标记临床数据的问题，我们采用了教师-学生蒸馏设置。一个在公共数据集上训练的监督专家模型为学生提供潜在的生理指导。课程分为三个阶段：干净的公共视频、合成遮挡场景和来自500名儿科患者的未标记视频。我们的框架在平均绝对误差（MAE）上相较于标准掩蔽自编码器减少了42%，并且比PhysFormer提高了31%，最终达到3.2 bpm的MAE。在没有明确的兴趣区域提取的情况下，该模型始终关注脉搏丰富区域，并在临床遮挡和噪声下表现出稳健性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.15973

LAND: A Longitudinal Analysis of Neuromorphic Datasets

LAND：神经形态数据集的纵向分析

Cohen, Gregory, Marcireau, Alexandre

Abstract

Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.

Chinese Translation

神经形态工程面临数据问题。尽管在过去十年中，发布的神经形态数据集数量急剧增加，但大量神经形态研究论文的结论仍然指出需要更多的数据和更大的数据集。这一需求部分源于现代深度学习方法所需的数据量，但也受到当前可用神经形态数据集状态的影响，以及寻找这些数据集、理解其目的和确定其基础任务性质的困难。此外，下载和使用这些数据集的实际困难进一步加剧了这一问题。本综述首先捕捉现有神经形态数据集的快照，涵盖超过423个数据集，然后探讨其任务性质和所呈现数据的基础结构。对这些数据集的分析显示了其规模、缺乏标准化以及获取实际数据的困难所带来的挑战。本文还强调了单个数据集规模的增长以及处理数据时所涉及的复杂性。然而，更重要的关切是合成数据集的兴起，这些数据集是通过仿真或视频转事件方法创建的。本综述探讨了模拟数据在测试现有算法和应用中的好处，并强调了在探索神经形态技术的新应用时可能面临的陷阱。此外，本综述还引入了元数据集的概念，这些数据集是由现有数据集创建的，旨在减少对更多数据的需求，并消除因定义数据集和任务而产生的潜在偏见。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.15989

SAM 3D Body: Robust Full-Body Human Mesh Recovery

SAM 3D Body：鲁棒的全身人类网格恢复

Yang, Xitong, Kukreja, Devansh, Pinkus, Don, Sagar, Anushka, Fan, Taosha, Park, Jinhyung, Shin, Soyong, Cao, Jinkun, Liu, Jiawei, Ugrinovic, Nicolas, Feiszli, Matt, Malik, Jitendra, Dollar, Piotr, Kitani, Kris

Abstract

We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

Chinese Translation

我们介绍了SAM 3D Body (3DB)，这是一个可提示的单图像全身3D人类网格恢复（HMR）模型，展现了最先进的性能，在多样的自然环境条件下具有强大的泛化能力和一致的准确性。3DB能够估计身体、脚和手的姿态。它是第一个使用新的参数化网格表示法——动量人类骨架（Momentum Human Rig, MHR）的模型，该方法将骨骼结构与表面形状解耦。3DB采用编码器-解码器架构，并支持辅助提示，包括2D关键点和掩码，使得用户引导的推理类似于SAM系列模型。我们通过多阶段注释流程生成高质量的注释，该流程使用手动关键点注释、可微优化、多视角几何和密集关键点检测的各种组合。我们的数据引擎高效地选择和处理数据，以确保数据的多样性，收集不寻常的姿态和稀有的成像条件。我们呈现了一个新的评估数据集，按姿态和外观类别组织，能够对模型行为进行细致的分析。我们的实验展示了优越的泛化能力，并在定性用户偏好研究和传统定量分析中相较于先前方法有显著提升。3DB和MHR均为开源。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.16006

BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features

BTReport：一个用于脑肿瘤放射学报告生成的框架，具有临床相关特征

Rivera, Juampablo E. Heras, Chen, Dickson T., Ren, Tianyi, Low, Daniel K., Abacha, Asma Ben, Santamaria-Pang, Alberto, Kurt, Mehmet

Abstract

Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at https://github.com/KurtLabUW/BTReport.

Chinese Translation

近年来，放射学报告生成（RRG）的进展得益于大型配对图像-文本数据集；然而，由于缺乏开放的配对图像-报告数据集，神经肿瘤学的进展受到限制。在此，我们介绍了BTReport，一个开放源代码的脑肿瘤RRG框架，它利用确定性提取的影像特征构建自然语言放射学报告。与现有方法依赖于大型通用或微调的视觉-语言模型进行图像解释和报告撰写不同，BTReport在图像分析中执行确定性特征提取，仅在句法结构和叙述格式方面使用大型语言模型。通过将RRG分为确定性特征提取步骤和报告生成步骤，生成的报告完全可解释且较少出现幻觉。我们展示了用于报告生成的特征能够预测关键临床结果，包括生存率和IDH突变状态，并且BTReport生成的报告与现有RRG基准相比，更加贴近参考临床报告。最后，我们介绍了BTReport-BraTS，一个伴随数据集，增强了BraTS影像，并使用BTReport生成合成的放射学报告。该项目的代码可以在https://github.com/KurtLabUW/BTReport找到。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.16019

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

MedProbCLIP：用于可靠放射影像-报告检索的视觉-语言基础模型的概率适应

Elallaf, Ahmad, Zhang, Yu, Masupalli, Yuktha Priya, Yang, Jeong, Lee, Young, Cao, Zechun, Liang, Gongbo

Abstract

Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

Chinese Translation

视觉-语言基础模型作为强大的通用表示学习工具，展现出在多模态理解方面的巨大潜力，但其确定性嵌入往往无法提供高风险生物医学应用所需的可靠性。本文介绍了MedProbCLIP，一种用于胸部X光影像和放射学报告表示学习及双向检索的概率视觉-语言学习框架。MedProbCLIP通过一种概率对比目标将图像和文本表示建模为高斯嵌入，明确捕捉放射影像与临床叙述之间的不确定性和多对多对应关系。变分信息瓶颈减轻了过于自信的预测，而MedProbCLIP在训练过程中采用多视角放射影像编码和多部分报告编码，为临床对齐的对应关系提供细粒度的监督，但在推理时仅需一幅放射影像和一份报告。在MIMIC-CXR数据集上的评估表明，MedProbCLIP在检索和零样本分类方面均优于包括CLIP、CXR-CLIP和PCME++在内的确定性和概率基线。除了准确性，MedProbCLIP还展示了优越的校准、风险覆盖行为、选择性检索可靠性以及对临床相关干扰的鲁棒性，强调了概率视觉-语言建模在提高放射影像-文本检索系统的可信度和安全性方面的价值。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.16086

LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

LGQ：用于可扩展和稳定图像标记的学习离散几何

Altun, Idil Bilge, Cakiroglu, Mert Onur, Buxton, Elham, Dalkilic, Mehmet, Kurban, Hasan

Abstract

Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

Chinese Translation

离散图像标记是可扩展视觉生成的关键瓶颈：标记器必须保持紧凑，以便有效地利用潜在空间先验，同时保持语义结构并有效使用离散容量。现有的量化器面临权衡：向量量化标记器学习灵活的几何形状，但常常受到偏置的直通优化、代码本的低利用率以及在大词汇量下的表示崩溃的影响。结构化标量或隐式标记器通过设计确保稳定、近乎完全的利用率，但依赖于固定的离散几何形状，这可能在异质潜在统计下低效分配容量。我们提出了可学习几何量化（Learnable Geometric Quantization，LGQ），这是一种端到端学习离散几何的离散图像标记器。LGQ用温度控制的软分配替代了硬最近邻查找，使得完全可微的训练成为可能，同时在推理时恢复硬分配。这些分配对应于各向同性高斯混合的后验责任，并最小化变分自由能目标，在低温极限下可证明收敛于最近邻量化。LGQ结合了标记级的峰值正则化器和全局使用正则化器，以鼓励自信而平衡的代码利用，而不强加严格的网格。在对ImageNet进行的受控VQGAN风格的基础上，LGQ在多个词汇大小下实现了稳定的优化和平衡的利用。在16K代码本大小下，LGQ相比FSQ提高了11.88%的rFID，同时使用了49.96%更少的活跃代码，并且相比SimVQ提高了6.06%的rFID，具有49.45%更低的有效表示率，以显著更少的活跃条目实现了可比的保真度。我们的GitHub代码库可在以下地址获取：https://github.com/KurbanIntelligenceLab/LGQ

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.16110

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

OmniCT：朝着统一的切片-体积大规模视觉语言模型实现全面的CT分析

Lin, Tianwei, Qiu, Zhongwei, Zhang, Wenqiao, Liu, Jiang, Xie, Yihan, Gao, Mingjian, Fan, Zhenxuan, Li, Zhaocheng, Li, Sijing, Xie, Zhongle, LU, Peng, Zhuang, Yueting, Xia, Yingda, Zhang, Ling, Ooi, Beng Chin

Abstract

Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

Chinese Translation

计算机断层扫描（CT）是最广泛使用且信息密集的诊断成像方式之一，覆盖心脏、肺、肝脏和结肠等重要器官。临床解读依赖于切片驱动的局部特征（例如，亚厘米结节、病变边界）和体积驱动的空间表示（例如，肿瘤浸润、器官间解剖关系）。然而，现有的大规模视觉语言模型（LVLMs）在CT切片与体积理解方面仍然存在碎片化现象：切片驱动的LVLMs表现出强大的泛化能力，但缺乏跨切片的空间一致性，而体积驱动的LVLMs则明确捕捉体积语义，但在粒度粗糙和与切片输入兼容性差方面存在问题。缺乏统一的建模范式构成了医学LVLMs临床转化的主要瓶颈。我们提出了OmniCT，一种强大的统一切片-体积LVLM，适用于CT场景，做出了三项贡献：（i）空间一致性增强（SCE）：体积切片组合结合三轴位置嵌入，引入体积一致性，MoE混合投影实现高效的切片-体积适应；（ii）器官级语义增强（OSE）：分割和ROI定位明确对齐解剖区域，强调病变和器官级语义；（iii）MedEval-CT：最大的切片-体积CT数据集和混合基准，集成全面的指标以实现统一评估。OmniCT在各种临床任务中始终超越现有方法，具有显著优势，满足微观层面的细节敏感性和宏观层面的空间推理。更重要的是，它为跨模态医学成像理解建立了新的范式。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.16132

CHAI: CacHe Attention Inference for text2video

CHAI：文本到视频的缓存注意力推理

Cherian, Joel Mathew, Bharadwaj, Ashutosh Muralidhara, Gupta, Vima, Iyer, Anand Padmanabha

Abstract

Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

Chinese Translation

文本到视频的扩散模型虽然取得了令人印象深刻的结果，但由于3D潜变量的顺序去噪过程，推理速度仍然较慢。现有的加速推理方法要么需要昂贵的模型重训练，要么使用基于启发式的步骤跳过，这在去噪步骤减少时难以保持视频质量。我们的工作CHAI旨在通过交叉推理缓存来减少延迟，同时保持视频质量。我们引入了缓存注意力（Cache Attention），作为一种有效的方法，用于关注跨交叉推理潜变量中的共享对象/场景。这种选择性注意机制使得在语义相关的提示之间有效重用缓存潜变量，从而实现高缓存命中率。我们展示了使用缓存注意力可以在仅8个去噪步骤的情况下生成高质量视频。当集成到整体系统中时，CHAI的速度比基线的OpenSora 1.2快1.65倍至3.35倍，同时保持视频质量。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.16138

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

IRIS：通过推理时眼动来解决开放式视觉问答中的意图

Madinei, Parsa, Karmakar, Srijita, Hoffing, Russell Cohen, Gervitz, Felix, Eckstein, Miguel P.

Abstract

We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

Chinese Translation

我们介绍了IRIS（通过推理时眼动解决意图），这是一种新颖的无训练方法，利用实时眼动追踪数据来解决开放式视觉问答中的歧义。通过对500对独特的图像-问题对进行全面的用户研究，我们证明了在参与者开始口头提问时，最接近的注视点对于大型视觉语言模型（Large VLMs）的歧义消解是最具信息量的，使得对歧义问题的回答准确率翻倍（从35.2%提高到77.2%），同时保持对明确查询的性能。我们在最先进的视觉语言模型上评估了我们的方法，显示出在包含歧义图像-问题对时，结合注视数据的一致性改进，无论架构差异如何。我们发布了一个新的基准数据集，用于利用眼动数据进行消歧视觉问答，提供了一种新颖的实时交互协议和评估套件。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.16149

Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

评估图像到图像肖像编辑中的人口统计误表征

Seo, Huichan, Hong, Minki, Choi, Sieun, Kim, Jihie, Oh, Jean

Abstract

Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias

Chinese Translation

文本到图像（T2I）生成中的人口统计偏见已得到充分研究，但在指导性图像到图像（I2I）编辑中，基于人口统计的失败仍然未被充分探讨。我们研究了在开放权重的I2I编辑器中，相同的编辑指令是否会在不同的人口统计特征中产生系统性不同的结果。我们正式定义了两种失败模式：软消除（Soft Erasure），即编辑在输出图像中被默默削弱或忽略；刻板印象替换（Stereotype Replacement），即编辑引入了未请求的、与刻板印象一致的属性。我们引入了一个受控基准，通过使用诊断提示集生成和编辑基于种族、性别和年龄的肖像，以探测基于人口统计的行为，并对多个编辑器进行视觉-语言模型（VLM）评分和人工评估。我们的分析表明，身份保留失败普遍存在，且在人口统计上不均衡，受到隐性社会先验的影响，包括基于职业的性别推断。最后，我们展示了在不更新模型的情况下，提示级别的身份约束可以显著减少少数群体的人口统计变化，同时对多数群体的肖像影响较小，揭示了当前编辑器中的不对称身份先验。综上所述，我们的研究结果确立了身份保留作为I2I编辑中的一个核心且人口统计上不均衡的失败模式，并激励开发人口统计稳健的编辑系统。项目页面：https://seochan99.github.io/i2i-demographic-bias

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.16160

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

基于不确定性引导的推理时深度适应用于基于变换器的视觉跟踪

Poggi, Patrick, Kumar, Divake, Tulabandhula, Theja, Trivedi, Amit Ranjan

Abstract

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.

Chinese Translation

基于变换器的单目标跟踪器在准确性上达到了最先进的水平，但依赖于固定深度的推理，对于每一帧都执行完整的编码器-解码器堆栈，而不考虑视觉复杂性，从而在长视频序列中因时间上相干的帧而产生不必要的计算成本。我们提出了UncL-STARK，这是一种架构保留的方法，能够在基于变换器的跟踪器中实现动态的不确定性感知深度适应，而无需修改底层网络或添加辅助头。该模型经过微调，以在多个中间深度下保持预测的鲁棒性，采用随机深度训练与知识蒸馏，从而实现安全的推理时截断。在运行时，我们直接从模型的角落定位热图中推导出轻量级的不确定性估计，并在反馈驱动的策略中使用它，根据预测置信度选择下一帧的编码器和解码器深度，利用视频中的时间一致性。对GOT-10k和LaSOT的广泛实验表明，在保持跟踪准确性在全深度基线的0.2%以内的同时，GFLOPs减少了多达12%，延迟减少了8.9%，能耗节省了10.8%，适用于短期和长期序列。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.16231

DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling

DataCube：基于自然语言语义剖析的视频检索平台

Ju, Yiming, Zhao, Hanyu, Ma, Quanyue, Hao, Donglin, Wu, Chengwei, Li, Ming, Wang, Songjing, Pan, Tengfei

Abstract

Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at https://datacube.baai.ac.cn/. Demo Video: https://baai-data-cube.ks3-cn-beijing.ksyuncs.com/custom/Adobe%20Express%20-%202%E6%9C%8818%E6%97%A5%20%281%29%281%29%20%281%29.mp4

Chinese Translation

大规模视频库在现代视频理解和生成任务中越来越普遍。然而，将原始视频转化为高质量、特定任务的数据集仍然成本高昂且效率低下。我们提出了DataCube，一个用于自动视频处理、多维剖析和基于查询的检索的智能平台。DataCube构建了视频片段的结构化语义表示，并支持结合神经重排序和深度语义匹配的混合检索。通过交互式网页界面，用户可以高效地从庞大的库中构建定制的视频子集，用于训练、分析和评估，并在自己的私人视频收藏上构建可检索的系统。该系统可公开访问，网址为 https://datacube.baai.ac.cn/。演示视频： https://baai-data-cube.ks3-cn-beijing.ksyuncs.com/custom/Adobe%20Express%20-%202%E6%9C%8818%E6%97%A5%20%281%29%281%29%20%281%29.mp4

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.16238

EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection

EasyControlEdge：一种用于边缘检测的基础模型微调方法

Nakamura, Hiroki, Iino, Hiroto, Okada, Masashi, Taniguchi, Tadahiro

Abstract

We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.

Chinese Translation

我们提出了EasyControlEdge，将图像生成基础模型适应于边缘检测。在实际的边缘检测中（例如，平面图墙壁、卫星道路/建筑物和医学器官边界），清晰度和数据效率至关重要，但在有限的训练样本下生成清晰的原始边缘图仍然具有挑战性。尽管图像生成基础模型在许多下游任务上表现良好，但它们在数据高效传输和高频细节保留的迭代精炼方面的预训练先验在边缘检测中仍未得到充分利用。为了利用这些能力实现清晰且数据高效的边缘检测，我们引入了图像生成基础模型的边缘专用适应。为了更好地专门化基础模型以进行边缘检测，我们结合了一个以边缘为导向的目标和高效的像素空间损失。在推理阶段，我们引入了基于无条件动态的指导，使单一模型能够通过指导尺度控制边缘密度。在BSDS500、NYUDv2、BIPED和CubiCasa上的实验与最先进的方法进行了比较，结果显示出一致的提升，特别是在无后处理清晰度评估和有限训练数据的情况下。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.16245

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

HyPCA-Net：推进医学图像分析中的多模态融合

Dhar, J., Pandey, M. K., Chakladar, D., Haghighat, M., Alavi, A., Mistry, S., Zaidi, N.

Abstract

Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.

Chinese Translation

多模态融合框架整合了多种医学成像模态（例如，MRI、CT），在皮肤癌检测、痴呆症诊断和脑肿瘤预测等应用中展现出巨大的潜力。然而，现有的多模态融合方法面临着显著的挑战。首先，它们通常依赖于计算成本高昂的模型，这限制了它们在低资源环境中的适用性。其次，它们往往采用级联注意模块，这可能在模块间过渡时增加信息丢失的风险，并妨碍其有效捕捉跨模态的稳健共享表征的能力。这限制了它们在多疾病分析任务中的泛化能力。为了解决这些局限性，我们提出了一种混合并行融合级联注意网络（HyPCA-Net），由两个核心新颖模块组成：（a）一个计算高效的残差自适应学习注意模块，用于捕捉精细的模态特定表征，以及（b）一个双视图级联注意模块，旨在学习跨多种模态的稳健共享表征。在十个公开可用的数据集上的广泛实验表明，HyPCA-Net显著优于现有的领先方法，性能提升高达5.2%，计算成本降低高达73.1%。代码链接：https://github.com/misti1203/HyPCA-Net。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.16249

AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

AFFMAE：适用于桌面显卡的可扩展高效视觉预训练

Smerkous, David, Wang, Zian, Najafian, Behzad

Abstract

Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.

Chinese Translation

自监督预训练已通过实现数据高效的微调来改变计算机视觉，但高分辨率训练通常需要服务器级基础设施，这限制了许多研究实验室在特定领域基础模型的发展。掩码自编码器（Masked Autoencoders, MAE）通过仅编码可见标记来减少计算量，但将 MAE 与分层下采样架构结合仍然在结构上具有挑战性，因为密集网格先验和掩码感知设计妥协。我们提出了 AFFMAE，一种基于自适应离网标记合并的友好掩码分层预训练框架。通过丢弃被掩码的标记并仅对可见标记执行动态合并，AFFMAE 消除了密集网格假设，同时保持了分层可扩展性。我们开发了数值稳定的混合精度 Flash 风格集群注意力内核，并通过深度监督减轻稀疏阶段表示崩溃。在高分辨率电子显微镜分割任务中，AFFMAE 在相同参数数量下与 ViT-MAE 的性能相匹配，同时将 FLOPs 降低了多达 7 倍，内存使用量减半，并在单个 RTX 5090 上实现了更快的训练。代码可在 https://github.com/najafian-lab/affmae 获取。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.16281

Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images

突破亚毫米障碍：基于彩色图像的眼镜框获取

Guzmán, Manel, Agudo, Antonio

Abstract

Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.

Chinese Translation

眼镜框镜片追踪是光学行业中的一个重要过程，要求达到亚毫米级的精度，以确保镜片的正确配合和最佳的视力矫正。传统的框架追踪器依赖于需要精确定位和校准的机械工具，这不仅耗时，还需要额外的设备，从而导致验光师的工作流程低效。本研究提出了一种基于人工视觉的新方法，利用多视角信息。所提出的算法在从InVision系统捕获的图像上运行。完整的流程包括图像获取、框架分割以从背景中分离眼镜框、深度估计以获取3D空间信息，以及多视角处理，将分割的RGB图像与深度数据结合，以实现精确的框架轮廓测量。为此，提出并分析了不同的配置和变体，在真实数据上提供了与其他解决方案相比的竞争性测量，同时消除了对专业追踪设备的需求，简化了光学技术人员的工作流程。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.16322

A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

一种自监督方法用于增强目标检测任务中的特征表示

Vilabella, Santiago C., Pérez-Núñez, Pablo, Remeseiro, Beatriz

Abstract

In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.

Chinese Translation

在快速发展的人工智能领域，随着模型复杂性和规模的不断增加，训练深度学习模型所需的标注数据的可用性已成为一个重大挑战。解决诸如目标检测等复杂问题需要大量时间和资源进行数据标注，以获得有意义的结果。对于开发此类应用的公司而言，这意味着需要在高技能人才上进行大量投资或进行高成本的外包。本研究旨在证明，增强特征提取器可以显著缓解这一挑战，使模型能够在较少的标注数据下学习更有效的表示。我们利用自监督学习策略，提出了一种在未标注数据上训练的模型，其性能超过了在ImageNet上预训练的最先进特征提取器，特别适用于目标检测任务。此外，结果表明，我们的方法促使模型关注对象的最相关方面，从而实现更好的特征表示，进而增强其可靠性和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.16337

Subtractive Modulative Network with Learnable Periodic Activations

可学习周期激活的减法调制网络

Wang, Tiou, Yang, Zhuoqian, Flierl, Markus, Salzmann, Mathieu, Süsstrunk, Sabine

Abstract

We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of $40+$ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at https://inrainbws.github.io/smn/.

Chinese Translation

我们提出了减法调制网络（Subtractive Modulative Network, SMN），这是一种新颖的、参数高效的隐式神经表示（Implicit Neural Representation, INR）架构，灵感来源于经典的减法合成。SMN被设计为一个原则性的信号处理管道，具有一个可学习的周期激活层（Oscillator），该层生成多频率基底，以及一系列调制掩模模块（Filters），这些模块主动生成高阶谐波。我们为我们的设计提供了理论分析和实证验证。我们的SMN在两个图像数据集上实现了超过40 dB的峰值信噪比（PSNR），在重建精度和参数效率方面与最先进的方法相比表现良好。此外，在具有挑战性的3D NeRF新视图合成任务中也观察到了持续的优势。补充材料可在 https://inrainbws.github.io/smn/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.16349

SCAR: Satellite Imagery-Based Calibration for Aerial Recordings

SCAR：基于卫星影像的空中记录自动校准

Hölzemann, Henry, Schleiss, Michael

Abstract

We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D--3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.

Chinese Translation

我们提出了SCAR，一种用于空中视觉惯性系统长期自动校准精细化的方法，该方法利用地理参考的卫星影像作为持久的全球参考。SCAR通过将空中图像与来自公开可用的正射影像和高程模型的2D-3D对应关系进行对齐，来估计内参和外参。与依赖专门校准动作或手动调查的地面控制点的现有方法相比，我们的方法利用外部地理空间数据来检测和修正在现场部署条件下的校准退化。我们在六个大规模空中活动中评估了我们的方法，这些活动在两年内在不同的季节和环境条件下进行。在所有序列中，SCAR始终优于已建立的基线（Kalibr, COLMAP, VINS-Mono），大幅降低了中位重投影误差，并将这些校准收益转化为显著较低的视觉定位旋转误差和更高的姿态精度。这些结果表明，SCAR在长期空中操作中提供了准确、稳健且可重复的校准，而无需手动干预。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.16385

Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

无参数自适应多尺度通道-空间注意力聚合框架用于3D室内语义场景补全，以辅助视觉障碍者

He, Qi, Wang, XiangXiang, Zhang, Jingtao, Yu, Yongbin, Chu, Hongxiang, Fan, Manping, Cai, JingYe, Yang, Zhenglin

Abstract

In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability.To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales.Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

Chinese Translation

在针对视觉障碍用户的室内辅助感知中，3D语义场景补全（SSC）期望在严格的单目视觉下提供结构一致且语义连贯的占用信息，以确保安全关键场景的理解。然而，现有的单目SSC方法往往缺乏对体素特征可靠性的明确建模，以及在2D-3D投影和多尺度融合过程中对跨尺度信息传播的调节，这使得它们容易受到投影扩散和特征纠缠的影响，从而限制了结构的稳定性。为了解决这些挑战，本文提出了一种基于MonoScene管道的自适应多尺度注意力聚合（AMAA）框架。AMAA并未引入更重的主干网络，而是专注于在单目SSC框架内进行以可靠性为导向的特征调节。具体而言，通过并行的通道-空间注意力聚合，提升的体素特征在语义和空间维度上进行联合校准，同时通过分层自适应特征门控策略稳定多尺度编码器-解码器融合，调节跨尺度的信息注入。在NYUv2基准上的实验表明，AMAA在不显著增加系统复杂性的情况下，相较于MonoScene实现了一致的提升：AMAA达到了27.25%的SSC mIoU（+0.31）和43.10%的SC IoU（+0.59）。此外，在NVIDIA Jetson平台上的系统级部署验证了完整的AMAA框架可以在嵌入式硬件上稳定执行。总体而言，AMAA提高了单目SSC的质量，并为针对视觉障碍用户的室内辅助系统提供了一个可靠且可部署的感知框架。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.16412

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

ReMoRa：基于精细运动表示的多模态大型语言模型用于长视频理解

Yashima, Daichi, Kurita, Shuhei, Oda, Yusuke, Sugiura, Komei

Abstract

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在广泛的任务中取得了显著成功，但长视频理解仍然是一个重大挑战。在本研究中，我们专注于通过MLLMs进行视频理解。该任务具有挑战性，因为处理完整的RGB帧流在计算上是不可行的且高度冗余，因为自注意力机制在序列长度上具有平方复杂度。在本文中，我们提出了ReMoRa，一种通过直接操作视频的压缩表示来处理视频的MLLM。我们保留了一组稀疏的RGB关键帧用于外观，同时将时间动态编码为运动表示，从而消除了对连续RGB帧的需求。这些运动表示作为光流的紧凑代理，捕捉时间动态而无需完整帧解码。为了改善基于块的运动的噪声和低保真度，我们引入了一个模块来去噪并生成细粒度的运动表示。此外，我们的模型以与序列长度线性缩放的方式压缩这些特征。通过在一系列全面的长视频理解基准上进行广泛实验，我们展示了ReMoRa的有效性。ReMoRa在多个具有挑战性的基准上超越了基线方法，包括LongVideoBench、NExT-QA和MLVU。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.16430

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

为印度设计生产规模的光学字符识别（OCR）系统：多语言和领域特定系统

Faraz, Ali, Kolla, Raja, Kulkarni, Ashish, Agarwal, Shubham

Abstract

Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

Chinese Translation

为印度设计光学字符识别（OCR）系统需要平衡语言多样性、文档异质性和部署限制。本文研究了通过Chitrapathak系列构建多语言OCR系统的两种训练策略。我们首先采用一种流行的多模态方法，将通用视觉编码器与强大的多语言语言模型配对，并对系统进行端到端的OCR训练。另一方面，我们探索了对现有OCR模型进行微调的可能性，尽管该模型并未针对目标语言进行训练。通过在多语言Indic OCR基准和面向部署的指标上进行广泛评估，我们发现第二种策略在准确性与延迟的权衡上始终表现更佳。Chitrapathak-2在速度上比其前身提升了3-6倍，并在泰卢固语中达到了最先进水平（SOTA）（6.69字符ANLS），在其他语言中也取得了第二好的成绩。此外，我们还提出了Parichay，这是一个专门为9种印度政府文件设计的独立OCR模型系列，旨在提取结构化的关键字段，达到了89.8%的精确匹配分数，并实现了更快的推理速度。这些系统共同实现了SOTA性能，并为在印度背景下构建生产规模的OCR管道提供了实用指导。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.16455

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

视觉自我精炼：一种像素引导的准确图表解析范式

Li, Jinsong, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Wang, Jiaqi, Lin, Dahua

Abstract

While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

Chinese Translation

尽管大型视觉-语言模型（LVLMs）在文本层面上展现了卓越的推理和自我修正能力，但这些优势对以视觉感知为中心的复杂任务（如图表解析）提供的帮助有限。现有模型在处理视觉密集型图表时常常面临数据遗漏、对齐错误和幻觉等问题。受到人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的策略启发，我们提出了一种新的范式，称为视觉自我精炼（Visual Self-Refine, VSR）。VSR的核心思想是使模型能够生成像素级定位输出，进行可视化，然后将这些可视化结果反馈给自身，从而使其能够直观地检查和修正潜在的视觉感知错误。我们在图表解析领域实例化了VSR范式，提出了ChartVSR模型。该模型将解析过程分为两个阶段：精炼阶段（Refine Stage），在此阶段中，它迭代地使用视觉反馈来确保所有数据点的像素级定位的准确性；解码阶段（Decode Stage），在此阶段中，它使用这些经过验证的定位作为精确的视觉锚点来解析最终的结构化数据。为了解决现有基准的局限性，我们还构建了ChartP-Bench，这是一个新的且具有高度挑战性的图表解析基准。我们的工作还强调了VSR作为一种通用的视觉反馈机制，为提升广泛视觉中心任务的准确性提供了一个有前景的新方向。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.16493

MMA: Multimodal Memory Agent

MMA：多模态记忆代理

Lu, Yihao, Cheng, Wanru, Zhang, Zeyu, Tang, Hao

Abstract

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

Chinese Translation

长时间跨度的多模态代理依赖于外部记忆；然而，基于相似性的检索常常会出现过时、低可信度或相互矛盾的项目，这可能导致过于自信的错误。我们提出了多模态记忆代理（MMA），该代理通过结合来源可信度、时间衰减和冲突感知网络共识，为每个检索到的记忆项分配一个动态可靠性评分，并利用这一信号重新加权证据，当支持不足时选择弃权。我们还引入了MMA-Bench，这是一个程序生成的基准，用于控制说话者可靠性和结构化文本-视觉矛盾的信念动态评估。使用该框架，我们揭示了“视觉安慰剂效应”，显示基于RAG的代理如何从基础模型中继承潜在的视觉偏见。在FEVER数据集上，MMA的准确率与基线相当，同时方差降低了35.2%，并提高了选择性效用；在LoCoMo上，面向安全的配置提高了可操作的准确性并减少了错误答案；在MMA-Bench上，MMA在视觉模式下达到了41.18%的B类准确率，而基线在相同协议下崩溃至0.0%。代码链接： https://github.com/AIGeeksGroup/MMA。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.16494

Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

物体检测的对抗鲁棒性与对抗训练策略基准评估

Winter, Alexis, Martini, Jean-Vincent, Audigier, Romaric, Loesch, Angelique, Luvison, Bertrand

Abstract

Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.

Chinese Translation

物体检测模型是自动化系统（如自动驾驶汽车和基于感知的机器人）的关键组成部分，但它们对对抗攻击的敏感性带来了严重的安全风险。与分类任务相比，这些模型的防御进展滞后，主要受限于缺乏标准化的评估方法。由于现有研究使用不同的数据集、不一致的效率指标和多样的扰动成本衡量标准，全面比较攻击或防御方法几乎是不可能的。本文通过研究三个关键问题来填补这一空白：（1）我们如何创建一个公平的基准，以公正地比较攻击？（2）现代攻击在不同架构之间的迁移能力如何，特别是从卷积神经网络（Convolutional Neural Networks）到视觉变换器（Vision Transformers）？（3）什么是最有效的对抗训练策略以实现鲁棒防御？为了解答这些问题，我们首先提出了一个统一的基准框架，专注于数字的、非补丁型攻击。该框架引入了特定的指标，以区分定位和分类错误，并使用多种感知指标评估攻击成本。利用这一基准，我们对最先进的攻击和多种检测器进行了广泛的实验。我们的研究结果揭示了两个主要结论：首先，针对物体检测模型的现代对抗攻击在迁移到基于变换器的架构时表现出显著的迁移性不足。其次，我们证明了最鲁棒的对抗训练策略利用了一个由不同目标（例如空间和语义）组成的高扰动攻击混合数据集，其性能优于对任何单一攻击的训练。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.16502

DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

DressWild：基于野外图像的前馈无姿态服装缝制图案生成

Tao, Zeng, Jiang, Ying, Chen, Yunuo, Xie, Tianyi, Wang, Huamin, Wu, Yingnian, Yang, Yin, Kumar, Abishek Sampath, Tashiro, Kenji, Jiang, Chenfanfu

Abstract

Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.

Chinese Translation

近年来，服装图案生成的进展令人鼓舞。然而，现有的前馈方法在处理多样的姿态和视角时面临困难，而基于优化的方法则计算成本高且难以扩展。本文聚焦于服装建模和制作应用中的缝制图案生成，这些应用要求可编辑、可分离且适合模拟的服装。我们提出了DressWild，一种新颖的前馈管道，能够从单张野外图像重建物理一致的二维缝制图案及相应的三维服装。给定输入图像，我们的方法利用视觉-语言模型（VLMs）在图像层面上规范化姿态变化，然后提取姿态感知的三维信息服装特征。这些特征通过基于变换器的编码器进行融合，随后用于预测缝制图案参数，这些参数可以直接应用于物理模拟、纹理合成和多层虚拟试穿。大量实验表明，我们的方法能够稳健地从野外图像中恢复多样的缝制图案及相应的三维服装，而无需多视角输入或迭代优化，提供了一种高效且可扩展的现实服装模拟和动画解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.16545

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

让我们分开：用于细粒度视频理解的零样本分类器编辑

Liu, Kaiting, Doughty, Hazel

Abstract

Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.

Chinese Translation

视频识别模型通常在固定的分类体系上进行训练，这些分类体系往往过于粗糙，将对象、方式或结果的区别合并为单一标签。随着任务和定义的演变，这些模型无法适应新出现的细微区别，而收集新的注释并进行重新训练以适应这些变化是成本高昂的。为了解决这些挑战，我们引入了类别分割这一新任务，在该任务中，现有分类器被编辑以将粗类别细化为更细的子类别，同时保持其他地方的准确性。我们提出了一种零样本编辑方法，利用视频分类器的潜在组合结构，在没有额外数据的情况下揭示细粒度的区别。我们进一步展示了低样本微调虽然简单，但效果显著，并且受益于我们的零样本初始化。在我们的新视频基准测试上进行的类别分割实验表明，我们的方法显著优于视觉-语言基线，在新分割的类别上提高了准确性，而不牺牲其他类别的性能。项目页面：https://kaitingliu.github.io/Category-Splitting/

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.16569

Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

Arc2Morph：基于Arc2Face的身份保留面部变形技术

Di Domenico, Nicolò, Franco, Annalisa, Ferrara, Matteo, Maltoni, Davide

Abstract

Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.

Chinese Translation

面部变形攻击被广泛认为是对电子身份文件中使用的面部识别系统最具挑战性的威胁之一。这些攻击利用了许多国家护照登记程序中的一个关键漏洞，即面部图像通常是在没有监督的实时捕捉过程中获取的。本文提出了一种基于Arc2Face的新型面部变形技术，该模型是一种身份条件的面部基础模型，能够从紧凑的身份表示中合成逼真的面部图像。我们通过在两个大型封闭的面部变形攻击检测数据集上比较变形攻击潜力指标，以及在两个来自FEI和ONOT的新型变形面部数据集上与几种最先进的变形方法进行比较，展示了所提方法的有效性。实验结果表明，所提出的基于深度学习的方法在变形攻击潜力上达到了与传统上被认为最具挑战性的基于地标的技术相当的水平。这些发现证实了所提方法在变形生成过程中有效保留和管理身份信息的能力。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.16590

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

基于注意力特征适应的对比学习框架用于街景图像分类

You, Qi, Cheng, Yitai, Zeng, Zichao, Haworth, James

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

Chinese Translation

街景图像属性分类是图像分类的重要下游任务，支持自动驾驶、城市分析和高清地图构建等应用。无论是从头训练、从预训练权重初始化，还是微调大型模型，这一任务在计算上都非常具有挑战性。尽管像 CLIP 这样的预训练视觉-语言模型提供了丰富的图像表示，现有的适应或微调方法往往依赖于其全局图像嵌入，这限制了它们捕捉复杂、杂乱街景中细粒度、局部属性的能力。为了解决这一问题，我们提出了 CLIP-MHAdapter，这是一种当前轻量级 CLIP 适应范式的变体，增加了一个瓶颈多层感知机（MLP），配备了在补丁令牌上操作的多头自注意力机制，以建模补丁间的依赖关系。CLIP-MHAdapter 具有大约 140 万个可训练参数，在 Global StreetScapes 数据集上的八个属性分类任务中实现了优越或具有竞争力的准确性，达到了新的最先进结果，同时保持了低计算成本。代码可在 https://github.com/SpaceTimeLab/CLIP-MHAdapter 获取。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.16664

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

通过自监督语义桥实现无配对图像到图像的转换

Liu, Jiaming, Petersen, Felix, Gao, Yunhe, Zhang, Yabin, Kim, Hyojin, Chaudhari, Akshay S., Sun, Yu, Ermon, Stefano, Gatidis, Sergios

Abstract

Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

Chinese Translation

对抗扩散和扩散反演方法推动了无配对图像到图像的转换，但每种方法都面临关键限制。对抗方法在训练过程中需要目标领域的对抗损失，这可能限制对未见数据的泛化，而扩散反演方法由于向噪声潜在表示的不完美反演，往往产生低保真度的转换。在本研究中，我们提出了自监督语义桥（Self-Supervised Semantic Bridge，SSB），这是一个多功能框架，将外部语义先验集成到扩散桥模型中，以实现空间上忠实的转换，而无需跨领域监督。我们的关键思想是利用自监督视觉编码器学习对外观变化不变但捕捉几何结构的表示，形成一个共享的潜在空间，以调节扩散桥。大量实验表明，SSB在具有挑战性的医学图像合成任务中，在领域内和领域外设置下均优于强大的先前方法，并且能够轻松扩展到高质量的文本引导编辑。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.16669

PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction

PredMapNet：一致性在线高清矢量地图构建的未来与历史推理

Lang, Bo, Savaliya, Nirav, Zheng, Zhihao, Feng, Jinglun, Yeh, Zheng-Hang, Chuah, Mooi Choo

Abstract

High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.

Chinese Translation

高清（HD）地图对于自动驾驶至关重要，为导航和规划提供了道路元素的结构化表示。然而，现有的基于查询的方法通常采用随机查询初始化，并依赖于隐式的时间建模，这导致在构建全球地图时出现时间不一致性和不稳定性。为了解决这些挑战，我们提出了一种新颖的端到端框架，用于一致性在线高清矢量地图构建，该框架联合执行地图实例跟踪和短期预测。首先，我们提出了一种语义感知查询生成器，通过空间对齐的语义掩码初始化查询，以全局捕捉场景级上下文。接下来，我们设计了一种历史光栅化地图记忆，用于存储每个跟踪实例的细粒度实例级地图，从而实现显式的历史先验。然后，历史地图引导模块将光栅化地图信息整合到跟踪查询中，提高了时间连续性。最后，我们提出了一种短期未来引导模块，根据存储的历史轨迹预测地图实例的即时运动。这些预测的未来位置为跟踪实例提供了提示，以进一步避免不合理的预测并保持时间一致性。在nuScenes和Argoverse2数据集上的大量实验表明，我们提出的方法在效率方面优于现有的最先进（SOTA）方法。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.16681

VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

VETime：视觉增强的零样本时间序列异常检测

Yang, Yingyuan, Lan, Tian, Gao, Yifei, Lu, Yimeng, He, Wenjun, Wang, Meng, Liu, Chenghao, Zhang, Chen

Abstract

Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.

Chinese Translation

时间序列异常检测（TSAD）需要识别即时的点异常和长期的上下文异常。然而，现有的基础模型面临一个根本性的权衡：一维时间模型提供细粒度的逐点定位，但缺乏全局上下文视角；而基于二维视觉的模型捕捉全局模式，但由于缺乏时间对齐和粗粒度的逐点检测，导致信息瓶颈。为了解决这一困境，我们提出了VETime，这是第一个通过细粒度的视觉-时间对齐和动态融合来统一时间和视觉模态的TSAD框架。VETime引入了可逆图像转换和补丁级时间对齐模块，以建立共享的视觉-时间时间线，保留区分性细节的同时保持时间敏感性。此外，我们设计了一种异常窗口对比学习机制和任务自适应多模态融合，以自适应地整合两种模态的互补感知优势。大量实验表明，VETime在零样本场景中显著优于现有的最先进模型，以更低的计算开销实现更高的定位精度。代码可在：https://github.com/yyyangcoder/VETime获取。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.16682

Learning Situated Awareness in the Real World

在现实世界中学习情境意识

Li, Chuhan, Han, Ruilin, Hsu, Joy, Liang, Yongyuan, Dhawan, Rajiv, Wu, Jiajun, Yang, Ming-Hsuan, Wang, Xin Eric

Abstract

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

Chinese Translation

人类感知的一个核心方面是情境意识，即将自己与周围的物理环境联系起来并在上下文中推理可能的行动的能力。然而，现有的大多数多模态基础模型（MFM）基准测试强调以环境为中心的空间关系（场景中对象之间的关系），而在很大程度上忽视了需要相对于代理的视角、姿态和运动进行推理的观察者中心关系。为了解决这一问题，我们引入了SAW-Bench（现实世界中的情境意识），这是一个用于评估自我中心情境意识的新基准，采用真实世界的视频。SAW-Bench包含786个使用Ray-Ban Meta（第二代）智能眼镜自录制的视频，涵盖多样的室内和室外环境，以及超过2,071个人工标注的问题-答案对。它通过六个不同的意识任务探测模型的观察者中心理解。我们的综合评估揭示了人类与模型之间的性能差距为37.66%，即使是表现最好的MFM，Gemini 3 Flash。除了这一差距，我们的深入分析还发现了一些显著的发现；例如，尽管模型能够利用自我中心视频中的部分几何线索，但它们往往无法推断出一致的相机几何形状，导致系统性的空间推理错误。我们将SAW-Bench定位为情境空间智能的基准，超越被动观察，理解物理基础的观察者中心动态。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.16689

Are Object-Centric Representations Better At Compositional Generalization?

面向对象的表示在组合泛化方面是否更优？

Kapl, Ferdinand, Mamaghan, Amir Mohammad Karimi, Seitzer, Maximilian, Johansson, Karl Henrik, Marr, Carsten, Bauer, Stefan, Dittadi, Andrea

Abstract

Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

Chinese Translation

组合泛化，即对熟悉概念的新组合进行推理的能力，是人类认知的基础，也是机器学习面临的一个关键挑战。面向对象（OC）表示将场景编码为一组对象，常被认为支持这种泛化，但在视觉丰富的环境中系统性证据有限。我们引入了一个视觉问答基准，涵盖三个受控视觉世界（CLEVRTex、Super-CLEVR 和 MOVi-C），以测量具有和不具有面向对象偏见的视觉编码器在未见对象属性组合上的泛化能力。为了确保公平和全面的比较，我们仔细考虑了训练数据的多样性、样本大小、表示大小、下游模型能力和计算资源。我们使用 DINOv2 和 SigLIP2，这两个广泛使用的视觉编码器，作为基础模型及其 OC 对应模型。我们的主要发现揭示了：(1) 在更困难的组合泛化设置中，OC 方法表现更优；(2) 原始密集表示在较简单的设置中超越 OC，且通常需要显著更多的下游计算；(3) OC 模型在样本效率上更高，以更少的图像实现更强的泛化，而密集编码器仅在数据和多样性充足时才能追赶或超越它们。总体而言，当数据集大小、训练数据多样性或下游计算受限时，面向对象的表示提供了更强的组合泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.16702

Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

关注显著性的多路径思维：重新审视视觉-语言推理

Shi, Mingjia, He, Yinhan, Zhu, Yaochen, Li, Jundong

Abstract

Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.

Chinese Translation

视觉-语言模型（VLMs）旨在通过联合利用视觉和文本模态进行推理。虽然在大语言模型（LLMs）中分配额外的推理时间计算已被证明是有效的，但在VLMs中实现类似的扩展仍然具有挑战性。一个主要障碍是视觉输入通常仅在生成开始时提供一次，而文本推理（例如，早期视觉摘要）是自回归生成的，这导致推理越来越以文本为主，并使早期视觉基础错误得以累积。此外，推理过程中对视觉基础的普通指导通常粗糙且嘈杂，难以在长文本中引导推理。为了解决这些挑战，我们提出了 extit{关注显著性原则}（SAP）选择。SAP基于高层次的推理原则而非基于标记级别的轨迹操作，这使得在嘈杂反馈下能够稳定地控制离散生成，同时允许后续推理步骤在需要重新基础时重新咨询视觉证据。此外，SAP支持多路径推理，使得能够并行探索多样的推理行为。SAP是模型无关且无数据的，不需要额外的训练。实证结果表明，SAP在可比的标记生成预算下实现了竞争力的性能，特别是在减少物体幻觉方面，同时比CoT风格的长序列推理提供了更稳定的推理和更低的响应延迟。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.16711

TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

TeCoNeRV：利用时间一致性实现可压缩的视频神经表示

Padmanabhan, Namitha, Gwilliam, Matthew, Shrivastava, Abhinav

Abstract

Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .

Chinese Translation

隐式神经表示（INRs）最近在视频压缩方面表现出色。然而，由于必须为每个视频单独过拟合一个INR，因此在保持编码效率的同时扩展到高分辨率视频仍然是一个重大挑战。基于超网络的方法以高速度预测未见视频的INR权重（超网络），但在高分辨率下质量较低、压缩大小较大且内存需求过高。我们通过三个关键贡献解决了这些基本限制：（1）一种空间和时间上分解权重预测任务的方法，通过将短视频片段分解为补丁管道，以减少预训练内存开销20倍；（2）一种基于残差的存储方案，仅捕捉连续片段表示之间的差异，显著减少比特流大小；（3）一种时间一致性正则化框架，鼓励权重空间的变化与视频内容相关联。我们提出的方法TeCoNeRV在UVG数据集上，在480p和720p下相较于基线实现了2.47dB和5.35dB的PSNR显著提升，且比特率降低36%，编码速度提高1.5-3倍。凭借我们低内存的使用，我们是第一个在UVG、HEVC和MCL-JCV上展示480p、720p和1080p结果的超网络方法。我们的项目页面可在 https://namithap10.github.io/teconerv/ 访问。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.16012

Towards Efficient Constraint Handling in Neural Solvers for Routing Problems

面向高效约束处理的神经求解器在路径规划问题中的应用

Bi, Jieyi, Cao, Zhiguang, Zhou, Jianan, Song, Wen, Wu, Yaoxin, Zhang, Jie, Ma, Yining, Wu, Cathy

Abstract

Neural solvers have achieved impressive progress in addressing simple routing problems, particularly excelling in computational efficiency. However, their advantages under complex constraints remain nascent, for which current constraint-handling schemes via feasibility masking or implicit feasibility awareness can be inefficient or inapplicable for hard constraints. In this paper, we present Construct-and-Refine (CaR), the first general and efficient constraint-handling framework for neural routing solvers based on explicit learning-based feasibility refinement. Unlike prior construction-search hybrids that target reducing optimality gaps through heavy improvements yet still struggle with hard constraints, CaR achieves efficient constraint handling by designing a joint training framework that guides the construction module to generate diverse and high-quality solutions well-suited for a lightweight improvement process, e.g., 10 steps versus 5k steps in prior work. Moreover, CaR presents the first use of construction-improvement-shared representation, enabling potential knowledge sharing across paradigms by unifying the encoder, especially in more complex constrained scenarios. We evaluate CaR on typical hard routing constraints to showcase its broader applicability. Results demonstrate that CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers.

Chinese Translation

神经求解器在解决简单路径规划问题方面取得了显著进展，尤其在计算效率上表现突出。然而，在复杂约束下，它们的优势仍处于初级阶段，目前通过可行性掩蔽或隐式可行性意识的约束处理方案在处理硬约束时可能效率低下或不适用。本文提出了构建与优化（Construct-and-Refine, CaR）框架，这是第一个基于显式学习的可行性优化的通用高效约束处理框架，专为神经路径规划求解器设计。与以往通过重大的改进来减少最优性差距但仍在处理硬约束方面苦苦挣扎的构建-搜索混合方法不同，CaR通过设计一个联合训练框架来实现高效的约束处理，指导构建模块生成多样化且高质量的解决方案，从而适合轻量级的改进过程，例如，相较于以往工作中的5000步，CaR仅需10步。此外，CaR首次采用了构建-优化共享表示，使得在更复杂的约束场景中能够跨范式共享潜在知识，通过统一编码器实现。我们在典型的硬路径规划约束上评估了CaR，以展示其更广泛的适用性。结果表明，CaR在可行性、解决方案质量和效率方面优于经典和神经最先进的求解器。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.16037

Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection

临床症状检测中自主代理工作流的优化不稳定性

Cagan, Cameron, Fard, Pedram, Tian, Jiazi, Cheng, Jingya, Murphy, Shawn N., Estiri, Hossein

Abstract

Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration successfully prevented catastrophic failure. With selector agent oversight, the system outperformed expert-curated lexicons on brain fog detection by 331% (F1) and chest pain by 7%, despite requiring only a single natural language term as input. These findings characterize a critical failure mode of autonomous AI systems and demonstrate that retrospective selection outperforms active intervention for stabilization in low-prevalence classification tasks.

Chinese Translation

自主代理工作流通过迭代自我优化行为展现出巨大的潜力，但其失败模式仍然不够明确。我们研究了优化不稳定性这一现象，即持续的自主改进反而使分类器性能下降，使用的是Pythia，一个用于自动化提示优化的开源框架。评估了三种不同流行率的临床症状（呼吸急促23%、胸痛12%和长期新冠脑雾3%），我们观察到验证灵敏度在迭代过程中在1.0和0.0之间波动，且其严重性与类别流行率呈反比关系。在3%的流行率下，系统在检测到零个阳性病例的情况下达到了95%的准确率，这一失败模式被标准评估指标所掩盖。我们评估了两种干预措施：一种是主动引导优化的代理，放大了过拟合而非纠正它；另一种是选择代理，能够回顾性地识别最佳表现的迭代，成功防止了灾难性失败。在选择代理的监督下，系统在脑雾检测上比专家策划的词汇表提高了331%的F1值，在胸痛检测上提高了7%，尽管仅需一个自然语言术语作为输入。这些发现揭示了自主人工智能系统的一个关键失败模式，并表明在低流行率分类任务中，回顾性选择优于主动干预以实现稳定。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.16039

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

评分的不确定性有多大？基于大型语言模型的自动评估不确定性度量基准

Li, Hang, Yang, Kaiqi, Long, Xianxuan, Filippov, Fedor, Chu, Yucheng, Copur-Gencturk, Yasemin, He, Peng, Miller, Cory, Shin, Namsoo, Krajcik, Joseph, Liu, Hui, Tang, Jiliang

Abstract

The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students' learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment. Although the effectiveness of these methods has been demonstrated in many tasks across other domains, their applicability and reliability in educational settings, particularly for automatic grading, remain underexplored. Through comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings, we characterize the uncertainty patterns exhibited by LLMs in grading scenarios. Based on these findings, we evaluate the strengths and limitations of different uncertainty metrics and analyze the influence of key factors, including model families, assessment tasks, and decoding strategies, on uncertainty estimates. Our study provides actionable insights into the characteristics of uncertainty in LLM-based automatic assessment and lays the groundwork for developing more reliable and effective uncertainty-aware grading systems in the future.

Chinese Translation

大型语言模型（LLMs）的快速崛起正在重塑教育中自动评估的格局。尽管这些系统在适应多样化问题类型和输出格式灵活性方面展现出显著优势，但它们也引入了与输出不确定性相关的新挑战，这源于LLMs固有的概率性质。输出不确定性是自动评估中不可避免的挑战，因为评估结果通常在指导后续教学行动中发挥关键作用，例如为学生提供反馈或指导教学决策。不可靠或校准不良的不确定性估计可能导致不稳定的后续干预，进而干扰学生的学习过程，并导致意想不到的负面后果。为了系统理解这一挑战并为未来的研究提供参考，我们在基于LLM的自动评估背景下，对多种不确定性量化方法进行了基准测试。尽管这些方法在其他领域的许多任务中已被证明有效，但它们在教育环境中的适用性和可靠性，特别是在自动评分方面，仍然未得到充分探索。通过对多个评估数据集、LLM家族和生成控制设置中的不确定性行为进行全面分析，我们描述了LLMs在评分场景中表现出的不确定性模式。基于这些发现，我们评估了不同不确定性度量的优缺点，并分析了模型家族、评估任务和解码策略等关键因素对不确定性估计的影响。我们的研究为理解基于LLM的自动评估中的不确定性特征提供了可操作的见解，并为未来开发更可靠和有效的不确定性感知评分系统奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.16050

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

基于证据的亚专业推理：评估2025年内分泌学委员会风格考试中的策划临床智能层

Hosseinian, Amir, Shahneh, MohammadReza Zare, Mansoor, Umer, Szeto, Gilbert, Karlin, Kirill, Aghaeepour, Nima

Abstract

Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

Chinese Translation

背景：大型语言模型在一般医学考试中表现出色，但由于快速发展的指南和细致的证据等级，亚专业临床推理仍然具有挑战性。方法：我们评估了基于证据的临床推理系统January Mirror，在120道内分泌学委员会风格考试题目中与前沿大型语言模型（GPT-5、GPT-5.2、Gemini-3-Pro）进行比较。Mirror整合了策划的内分泌学和心脏代谢证据语料库与结构化推理架构，以生成与证据相关的输出。Mirror在没有外部检索的封闭证据约束下运行。比较的LLM实时访问指南和原始文献。结果：Mirror的准确率为87.5%（105/120；95% CI：80.4-92.3%），超过了人类参考的62.3%以及前沿LLM，包括GPT-5.2（74.6%）、GPT-5（74.0%）和Gemini-3-Pro（69.8%）。在30个最难的问题上（人类准确率低于50%），Mirror的准确率达到了76.7%。Mirror的前2名准确率为92.5%，而GPT-5.2为85.25%。结论：Mirror提供了证据可追溯性：74.2%的输出引用了至少一个指南级别的来源，手动验证的引用准确率为100%。具有明确来源的策划证据在亚专业临床推理中可以超越不受限制的网络检索，并支持临床部署的可审计性。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.16066

Improving Interactive In-Context Learning from Natural Language Feedback

通过自然语言反馈改善互动式上下文学习

Klissarov, Martin, Cook, Jonathan, Antognini, Diego, Sun, Hao, Li, Jingling, Jaques, Natasha, Musat, Claudiu, Grefenstette, Edward

Abstract

Adapting one's thought process based on corrective feedback is an essential ability in human learning, particularly in collaborative settings. In contrast, the current large language model training paradigm relies heavily on modeling vast, static corpora. While effective for knowledge acquisition, it overlooks the interactive feedback loops essential for models to adapt dynamically to their context. In this work, we propose a framework that treats this interactive in-context learning ability not as an emergent property, but as a distinct, trainable skill. We introduce a scalable method that transforms single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. We first show that current flagship models struggle to integrate corrective feedback on hard reasoning tasks. We then demonstrate that models trained with our approach dramatically improve the ability to interactively learn from language feedback. More specifically, the multi-turn performance of a smaller model nearly reaches that of a model an order of magnitude larger. We also observe robust out-of-distribution generalization: interactive training on math problems transfers to diverse domains like coding, puzzles and maze navigation. Our qualitative analysis suggests that this improvement is due to an enhanced in-context plasticity. Finally, we show that this paradigm offers a unified path to self-improvement. By training the model to predict the teacher's critiques, effectively modeling the feedback environment, we convert this external signal into an internal capability, allowing the model to self-correct even without a teacher.

Chinese Translation

根据纠正反馈调整思维过程是人类学习中的一项重要能力，尤其是在协作环境中。相比之下，目前的大型语言模型训练范式过于依赖于对庞大、静态语料库的建模。虽然这种方法在知识获取方面有效，但忽视了模型动态适应其上下文所需的互动反馈循环。在本研究中，我们提出了一个框架，将这种互动式上下文学习能力视为一种独特的、可训练的技能，而非一种自发的特性。我们引入了一种可扩展的方法，将单轮可验证任务转化为由信息不对称驱动的多轮教学互动。我们首先展示了当前旗舰模型在处理困难推理任务时整合纠正反馈的能力不足。然后，我们证明了采用我们的方法训练的模型在从语言反馈中互动学习的能力上显著提高。更具体地说，一个较小模型的多轮表现几乎达到了一个数量级更大的模型的水平。我们还观察到强大的分布外泛化能力：在数学问题上的互动训练能够迁移到编码、谜题和迷宫导航等多样化领域。我们的定性分析表明，这一改善归因于增强的上下文可塑性。最后，我们展示了这一范式为自我改进提供了一条统一的路径。通过训练模型预测教师的批评，有效建模反馈环境，我们将这一外部信号转化为内部能力，使模型即使在没有教师的情况下也能自我纠正。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.16105

GPSBench: Do Large Language Models Understand GPS Coordinates?

GPSBench：大型语言模型是否理解GPS坐标？

Truong, Thinh Hung, Lau, Jey Han, Qi, Jianzhong

Abstract

Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench

Chinese Translation

大型语言模型（LLMs）越来越多地应用于与物理世界互动的场景，如导航、机器人技术或制图，因此稳健的地理空间推理成为一项关键能力。尽管如此，LLMs在推理GPS坐标和现实世界地理方面的能力仍然未得到充分探索。我们提出了GPSBench，一个包含57,800个样本和17个任务的数据集，用于评估LLMs的地理空间推理能力，涵盖几何坐标操作（例如，距离和方位计算）以及将坐标与世界知识结合的推理。我们关注模型的内在能力而非工具使用，评估了14个最先进的LLMs，发现GPS推理仍然具有挑战性，各任务之间存在显著差异：模型在现实世界地理推理方面通常比在几何计算方面更可靠。地理知识呈现出分层退化的特征，国家级表现强劲但城市级定位较弱，而对坐标噪声的鲁棒性则表明模型对坐标的理解是真实的，而非单纯的记忆。我们进一步展示了GPS坐标增强可以改善下游地理空间任务的表现，并且微调会在几何计算的提升与世界知识的退化之间产生权衡。我们的数据集和可复现的代码可在https://github.com/joey234/gpsbench获取。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.16173

Learning Personalized Agents from Human Feedback

从人类反馈中学习个性化代理

Liang, Kaiqu, Kruk, Julia, Qian, Shengyi, Yang, Xianjun, Bi, Shengjie, Yao, Yuanshun, Nie, Shaoliang, Zhang, Mingyang, Liu, Lijuan, Fisac, Jaime Fernández, Zhou, Shuyan, Hosseini, Saghar

Abstract

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.

Chinese Translation

现代人工智能代理功能强大，但往往无法与个体用户独特且不断变化的偏好保持一致。以往的方法通常依赖于静态数据集，要么在交互历史上训练隐式偏好模型，要么在外部记忆中编码用户档案。然而，这些方法在面对新用户和随时间变化的偏好时表现不佳。我们提出了基于人类反馈的个性化代理（Personalized Agents from Human Feedback, PAHF），这是一个持续个性化的框架，代理可以通过使用明确的每用户记忆在线学习实时交互。PAHF 实现了一个三步循环： (1) 寻求行动前的澄清以解决模糊性，(2) 基于从记忆中检索的偏好来确定行动，(3) 整合行动后的反馈以在偏好漂移时更新记忆。为了评估这一能力，我们开发了一个四阶段协议和两个基准，分别用于具身操作和在线购物。这些基准量化了代理从零开始学习初始偏好的能力，并随后适应个性变化。我们的理论分析和实证结果表明，整合显式记忆与双重反馈通道至关重要：PAHF 学习速度显著更快，并且始终优于无记忆和单通道基准，减少了初始个性化错误，并使得快速适应偏好变化成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.16179

EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

EnterpriseGym Corecraft：在高保真强化学习环境中训练可泛化的智能体

Mehta, Sushant, Ritchie, Logan, Garre, Suhaas, Heiner, Nick, Chen, Edwin

Abstract

We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% on $\tau^2$-Bench Retail, and +6.8\% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

Chinese Translation

我们展示了在高保真强化学习环境中训练人工智能智能体能够产生超越训练分布的能力。我们介绍了 extsc{EnterpriseGym}中的第一个环境 extcorecraft{}，这是Surge AI的一套智能体强化学习环境。 extcorecraft{}是一个全面运作的客户支持组织企业模拟，包含超过2500个实体，涵盖14种实体类型以及23种独特工具，旨在测量人工智能智能体是否能够执行真实工作所需的多步骤、特定领域的任务。前沿模型如GPT-5.2和Claude Opus 4.6在所有专家编写的评分标准都必须满足的情况下，解决的任务少于30%。利用该环境，我们使用群体相对策略优化（Group Relative Policy Optimization, GRPO）和自适应剪辑训练GLM~4.6。在经过一次训练周期后，该模型在保留的评估任务上的任务通过率从25.37%提高到36.76%。更重要的是，这些提升转移到了分布外基准测试：在BFCL Parallel上提高了4.5%，在$ au^2$-Bench Retail上提高了7.4%，在Toolathlon（Pass@1）上提高了6.8%。我们认为三个环境特性与观察到的转移一致：以任务为中心的世界构建，优化多样且具有挑战性的任务；专家编写的评分标准使可靠的奖励计算成为可能；以及反映现实职业模式的企业工作流程。我们的结果表明，环境的质量、多样性和现实性是实现可泛化智能体能力的关键因素。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.16192

Revolutionizing Long-Term Memory in AI: New Horizons with High-Capacity and High-Speed Storage

颠覆人工智能中的长期记忆：高容量与高速存储的新视野

Yamanaka, Hiroaki, Miyashita, Daisuke, Toi, Takashi, Maki, Asuka, Ikeda, Taiga, Deguchi, Jun

Abstract

Driven by our mission of "uplifting the world with memory," this paper explores the design concept of "memory" that is essential for achieving artificial superintelligence (ASI). Rather than proposing novel methods, we focus on several alternative approaches whose potential benefits are widely imaginable, yet have remained largely unexplored. The currently dominant paradigm, which can be termed "extract then store," involves extracting information judged to be useful from experiences and saving only the extracted content. However, this approach inherently risks the loss of information, as some valuable knowledge particularly for different tasks may be discarded in the extraction process. In contrast, we emphasize the "store then on-demand extract" approach, which seeks to retain raw experiences and flexibly apply them to various tasks as needed, thus avoiding such information loss. In addition, we highlight two further approaches: discovering deeper insights from large collections of probabilistic experiences, and improving experience collection efficiency by sharing stored experiences. While these approaches seem intuitively effective, our simple experiments demonstrate that this is indeed the case. Finally, we discuss major challenges that have limited investigation into these promising directions and propose research topics to address them.

Chinese Translation

本论文以“通过记忆提升世界”为使命，探讨实现人工超智能（ASI）所必需的“记忆”设计概念。我们并不提出新颖的方法，而是关注几种潜在益处广泛可想象但尚未被充分探索的替代方法。目前主导的范式可以称为“提取后存储”，其过程是从经验中提取被认为有用的信息，并仅保存提取的内容。然而，这种方法固有地存在信息丢失的风险，因为在提取过程中某些对不同任务特别有价值的知识可能会被丢弃。相对而言，我们强调“先存储后按需提取”的方法，旨在保留原始经验，并根据需要灵活应用于各种任务，从而避免信息的损失。此外，我们还强调两种进一步的方法：从大量概率经验中发现更深层次的见解，以及通过共享存储的经验来提高经验收集的效率。尽管这些方法在直觉上似乎有效，但我们的简单实验表明确实如此。最后，我们讨论了限制对这些有前景方向的研究的主要挑战，并提出了应对这些挑战的研究主题。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.16246

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

迈向可扩展的可验证奖励：基于代理状态的多轮工具调用大型语言模型代理评估

Chuang, Yun-Shiuan, Kulkarni, Chaitanya, Chiu, Alec, Thangali, Avinash, Pan, Zijie, Shekhar, Shivani, Ge, Yirou, Li, Yixi, Kona, Uma, Pang, Linsey, Mehrotra, Prakhar

Abstract

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

Chinese Translation

通过多轮对话和多步骤工具调用操作的交互式大型语言模型（LLM）代理在生产中越来越多地被使用。这些代理的基准测试必须既能可靠地比较模型，又能产生在政策内的训练数据。先前的代理基准（例如，tau-bench、tau2-bench、AppWorld）依赖于完全确定性的后端，这些后端的构建和迭代成本高昂。我们提出了基于代理状态的评估，这是一种由LLM驱动的仿真框架，能够在没有确定性数据库的情况下保留基于最终状态的评估。具体而言，场景指定用户目标、用户/系统事实、预期最终状态和预期代理行为，LLM状态跟踪器从完整的交互轨迹中推断出结构化的代理状态。然后，LLM评审者根据场景约束验证目标完成情况并检测工具/用户的幻觉。实证结果表明，我们的基准在不同模型和推理时间的推理努力中产生了稳定的、区分模型的排名，其在政策内/外的推广提供了对未见场景的监督。仔细的场景规范使得模拟器的幻觉率接近零，这得到了消融研究的支持。该框架还支持对用户角色的敏感性分析。人类-LLM评审者的一致性超过90%，表明自动评估的可靠性。总体而言，基于代理状态的评估为工业LLM代理提供了一种实用、可扩展的替代方案，取代了确定性的代理基准。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.16301

Multi-agent cooperation through in-context co-player inference

通过上下文共玩家推理实现多智能体合作

Weis, Marissa A., Wołczyk, Maciej, Nasser, Rajai, Saurous, Rif A., Arcas, Blaise Agüera y, Sacramento, João, Meulemans, Alexander

Abstract

Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between "learning-aware" agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between "naive learners" updating on fast timescales and "meta-learners" observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.

Chinese Translation

在多智能体强化学习中，实现自利智能体之间的合作仍然是一个基本挑战。最近的研究表明，能够考虑并塑造其共玩家学习动态的“学习感知”智能体之间可以诱导出相互合作。然而，现有的方法通常依赖于硬编码的、往往不一致的共玩家学习规则假设，或者在“天真学习者”在快速时间尺度上更新与“元学习者”观察这些更新之间强制划分严格的界限。在这里，我们展示了序列模型的上下文学习能力允许共玩家学习意识，而无需硬编码假设或明确的时间尺度分离。我们表明，针对多样化的共玩家分布训练序列模型智能体自然诱导出上下文最佳响应策略，有效地在快速的剧集内时间尺度上充当学习算法。我们发现，在先前研究中识别出的合作机制——对敲诈的脆弱性驱动相互塑造——在这种环境中自然出现：上下文适应使智能体对敲诈变得脆弱，而由此产生的相互压力去塑造对手的上下文学习动态最终转化为合作行为的学习。我们的结果表明，标准的去中心化强化学习与序列模型结合共玩家多样性提供了一条可扩展的学习合作行为的路径。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.16424

Verifiable Semantics for Agent-to-Agent Communication

可验证的代理间通信语义

Schoenegger, Philipp, Carlson, Matt, Schneider, Chris, Daly, Chris

Abstract

Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms ("core-guarded reasoning") achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.

Chinese Translation

多智能体人工智能系统需要一致的通信，但我们缺乏验证代理是否对所使用术语有相同理解的方法。自然语言可解释，但易受语义漂移的影响，而学习到的协议高效但不透明。我们提出了一种基于刺激-意义模型的认证协议，在该协议中，代理通过共享的可观察事件进行测试，当经验性分歧低于统计阈值时，术语被认证。在该协议中，限制推理仅基于认证术语的代理（“核心保护推理”）能够实现可证明的有限分歧。我们还概述了检测漂移（重新认证）和恢复共享词汇（重新谈判）的机制。在具有不同语义偏差程度的模拟中，核心保护将分歧减少了72-96%。在使用微调语言模型的验证中，分歧减少了51%。我们的框架为可验证的代理间通信提供了第一步。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.16435

Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

基于因果指导的多智能体强化学习自动特征工程

Malarkkan, Arun Vignesh, Ying, Wangyang, Fu, Yanjie

Abstract

Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential decision process, bridging causal discovery with reinforcement learning-driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi-agent deep Q-learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group-level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro-F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, and delivers competitive time-to-target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non-causal multi-agent baseline, and produces more compact feature sets with more stable post-hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.

Chinese Translation

自动特征工程（AFE）使人工智能系统能够自主地从原始表格数据中构建高效用的表示。然而，现有的AFE方法依赖于统计启发式，导致生成的特征在分布变化下表现脆弱。我们提出了CAFE，一个将AFE重新表述为因果指导的序列决策过程的框架，将因果发现与强化学习驱动的特征构建相结合。第一阶段学习特征与目标之间的稀疏有向无环图，以获得软因果先验，根据特征对目标的因果影响将特征分为直接、间接或其他。第二阶段使用级联的多智能体深度Q学习架构选择因果组和变换算子，采用分层奖励塑造和因果组级探索策略，优先考虑因果上合理的变换，同时控制特征复杂性。在15个公共基准测试（分类使用宏F1；回归使用逆相对绝对误差）中，CAFE相较于强大的AFE基线实现了最高7%的提升，减少了收敛所需的回合数，并提供了具有竞争力的目标时间。在控制协变量变化的情况下，CAFE相较于非因果多智能体基线减少了约4倍的性能下降，并生成了更紧凑的特征集，具有更稳定的事后归因。这些发现强调了因果结构作为软归纳先验而非刚性约束的使用，可以显著提高自动特征工程的鲁棒性和效率。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.16481

Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach

利用大型语言模型进行因果发现：一种基于约束的、以论证驱动的方法

Li, Zihao, Russo, Fabrizio

Abstract

Causal discovery seeks to uncover causal relations from data, typically represented as causal graphs, and is essential for predicting the effects of interventions. While expert knowledge is required to construct principled causal graphs, many statistical methods have been proposed to leverage observational data with varying formal guarantees. Causal Assumption-based Argumentation (ABA) is a framework that uses symbolic reasoning to ensure correspondence between input constraints and output graphs, while offering a principled way to combine data and expertise. We explore the use of large language models (LLMs) as imperfect experts for Causal ABA, eliciting semantic structural priors from variable names and descriptions and integrating them with conditional-independence evidence. Experiments on standard benchmarks and semantically grounded synthetic graphs demonstrate state-of-the-art performance, and we additionally introduce an evaluation protocol to mitigate memorisation bias when assessing LLMs for causal discovery.

Chinese Translation

因果发现旨在从数据中揭示因果关系，通常以因果图的形式表示，对于预测干预的效果至关重要。尽管构建原则性因果图需要专家知识，但已经提出了许多统计方法，以利用具有不同形式保证的观察数据。因果假设基础的论证（Causal Assumption-based Argumentation, ABA）是一种框架，利用符号推理确保输入约束与输出图之间的对应关系，同时提供了一种将数据与专业知识结合的原则性方法。我们探索了将大型语言模型（Large Language Models, LLMs）作为因果ABA的不完美专家的应用，从变量名称和描述中引出语义结构先验，并将其与条件独立性证据相结合。在标准基准和语义基础的合成图上的实验展示了最先进的性能，我们还引入了一种评估协议，以减轻在评估LLMs进行因果发现时的记忆偏差。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.16512

Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs

思维框架：基于链、树和图的动态优化推理基础框架

Fricke, Felix, Malberg, Simon, Groh, Georg

Abstract

Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT)--a general-purpose foundation framework for building and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT's capabilities by implementing three popular schemes--Tree of Thoughts, Graph of Thoughts, and ProbTree--within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.

Chinese Translation

诸如思维链（Chain of Thought）、思维树（Tree of Thoughts）和思维图（Graph of Thoughts）等提示方案可以显著增强大型语言模型的推理能力。然而，现有的大多数方案要求用户定义静态的、特定问题的推理结构，这些结构缺乏对动态或未见问题类型的适应性。此外，这些方案在超参数、提示、运行时间和提示成本等方面往往未经过优化。为了解决这些局限性，我们提出了思维框架（Framework of Thoughts，FoT）——一个用于构建和优化动态推理方案的通用基础框架。FoT内置了超参数调优、提示优化、并行执行和智能缓存等功能，释放了推理方案的潜在性能。我们通过在FoT中实现三种流行方案——思维树、思维图和ProbTree，展示了FoT的能力。我们通过实证研究表明，FoT显著加快了执行速度，降低了成本，并通过优化实现了更好的任务得分。我们发布了我们的代码库，以促进未来动态高效推理方案的发展。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.16578

Creating a digital poet

创造一个数字诗人

Tohar, Vered, Hayat, Tsahi, Leshem, Amir

Abstract

Can a machine write good poetry? Any positive answer raises fundamental questions about the nature and value of art. We report a seven-month poetry workshop in which a large language model was shaped into a digital poet through iterative in-context expert feedback, without retraining. Across sessions, the model developed a distinctive style and a coherent corpus, supported by quantitative and qualitative analyses, and it produced a pen name and author image. In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence intervals including 50%. After the workshop, a commercial publisher released a poetry collection authored by the model. These results show that workshop-style prompting can support long-horizon creative shaping and renew debates on creativity and authorship.

Chinese Translation

机器能写出好的诗吗？任何积极的回答都引发了关于艺术本质和价值的根本性问题。我们报告了一项为期七个月的诗歌工作坊，在此期间，一个大型语言模型通过迭代的上下文专家反馈被塑造成一个数字诗人，而无需重新训练。在各个会议中，该模型发展出独特的风格和连贯的作品集，得到了定量和定性分析的支持，并且它创造了一个笔名和作者形象。在一项针对50名人文学科学生和毕业生的盲测中（每人评估三首AI诗和三首著名诗人的诗），判断结果接近随机：人类诗作被标记为人类的比例为54%，AI诗作为52%，95%的置信区间包括50%。在工作坊结束后，一家商业出版社发布了由该模型创作的诗集。这些结果表明，工作坊风格的提示可以支持长期的创造性塑造，并重新引发关于创造力和作者身份的辩论。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.16653

Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments

代理技能框架：小型语言模型在工业环境中的潜力视角

Xu, Yangjie, Li, Lujun, Sleem, Lama, Gentile, Niccolo, Song, Yewei, Wang, Yiqun, Ji, Siming, Wu, Wenbo, State, Radu

Abstract

Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.

Chinese Translation

代理技能框架（Agent Skill framework）现在得到了GitHub Copilot、LangChain和OpenAI等主要参与者的广泛和正式支持，通过改善上下文工程、减少幻觉并提高任务准确性，在专有模型中表现尤为出色。基于这些观察，本研究旨在探讨代理技能范式是否为小型语言模型（SLMs）提供类似的好处。这个问题在工业场景中尤为重要，因为在数据安全和预算限制的要求下，持续依赖公共API是不可行的，而SLMs在高度定制化的场景中往往表现出有限的泛化能力。本研究引入了代理技能过程的正式数学定义，并对不同规模的语言模型在多个用例中的表现进行了系统评估。评估涵盖了两个开源任务和一个真实的保险索赔数据集。结果表明，微型模型在可靠的技能选择方面存在困难，而中等规模的SLMs（大约12B - 30B参数）则从代理技能方法中获得了显著的收益。此外，约80B参数的代码专用变体在性能上可与闭源基准相媲美，同时提高了GPU效率。总体而言，这些发现为框架的能力和限制提供了全面而细致的特征描述，同时为在以SLM为中心的环境中有效部署代理技能提供了可行的见解。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.16666

Towards a Science of AI Agent Reliability

迈向人工智能代理可靠性的科学

Rabanser, Stephan, Kapoor, Sayash, Kirgis, Peter, Liu, Kangheng, Utpala, Saiteja, Narayanan, Arvind

Abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

Chinese Translation

人工智能代理越来越多地被部署来执行重要任务。尽管在标准基准测试中准确率的提升表明了快速进展，但许多代理在实际应用中仍然频繁失败。这一差异突显了当前评估的一个基本局限性：将代理行为压缩为单一成功指标掩盖了关键的操作缺陷。特别是，它忽视了代理在多次运行中是否表现一致、是否能够抵御扰动、是否可预测地失败或错误严重性是否有限。基于安全关键工程，我们通过提出十二个具体指标来提供全面的性能概况，这些指标从一致性、鲁棒性、可预测性和安全性四个关键维度分解代理的可靠性。在两个互补的基准测试中评估了14个代理模型，我们发现近期能力的提升仅带来了可靠性的小幅改善。通过揭示这些持续存在的局限性，我们的指标补充了传统评估，同时提供了关于代理如何表现、退化和失败的推理工具。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.15843

The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

困惑悖论：为何代码在大型语言模型提示中比数学更易压缩

Johnson, Warren

Abstract

In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.

Chinese Translation

在《压缩还是路由？》（Johnson, 2026）一文中，我们发现代码生成能够容忍激进的提示压缩（r >= 0.6），而链式推理则逐渐退化。该研究仅限于 HumanEval（164 个问题），未验证“困惑悖论”机制，并且未提供自适应算法。本文解决了这三个缺口。首先，我们在六个代码基准（HumanEval, MBPP, HumanEval+, MultiPL-E）和四个推理基准（GSM8K, MATH, ARC-Challenge, MMLU-STEM）上进行了验证，确认压缩阈值在不同语言和难度间具有普遍性。其次，我们进行了首次逐词困惑度分析（n=723 个词元），揭示了“困惑悖论”：代码语法词元被保留（高困惑度），而数学问题中的数值虽然对任务至关重要却被削减（低困惑度）。签名注入使通过率恢复了 +34 个百分点（从 5.3% 提升至 39.3%；Cohen's h=0.890）。第三，我们提出了 TAAC（任务感知自适应压缩），实现了 22% 的成本降低，同时保持 96% 的质量，超越固定比例压缩 7%。MBPP 验证（n=1,800 次试验）确认了系统性变化：在 r=0.3 时为 3.6%，在 r=1.0 时为 54.6%。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.15844

Language Model Representations for Efficient Few-Shot Tabular Classification

高效少样本表格分类的语言模型表示

Kang, Inwon, Ram, Parikshit, Zhou, Yi, Samulowitz, Horst, Seneviratne, Oshani

Abstract

The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, $\textbf{Ta}$ble $\textbf{R}$epresentation with $\textbf{L}$anguage Model~($\textbf{TaRL}$), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ($k \leq 32$) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.

Chinese Translation

网络是一个丰富的结构化数据源，数据以表格形式存在，包括产品目录、知识库和科学数据集。然而，这些表格的结构和语义的异质性使得构建一个统一的方法来有效利用其包含的信息变得具有挑战性。同时，大型语言模型（LLMs）正成为网络基础设施中越来越重要的组成部分，应用于语义搜索等任务。这引发了一个关键问题：我们能否利用这些已经部署的LLMs来对网络原生表格（例如，产品目录、知识库导出、科学数据门户）中的结构化数据进行分类，而无需专门的模型或大量的再训练？本研究探讨了一种轻量级范式——表格表示与语言模型（Table Representation with Language Model，TaRL），用于少样本表格分类，直接利用单个表格行的语义嵌入。我们首先表明，这些嵌入的简单应用表现不如专门的表格模型。然后，我们展示了通过两项关键技术可以释放它们的潜力：从所有嵌入中去除共同成分和校准softmax温度。我们证明了一个简单的元学习器，可以在手工特征上进行训练，以预测合适的温度。这种方法在语义丰富的表格的低数据环境（k ≤ 32）中实现了与最先进模型相当的性能。我们的研究结果表明，重用现有LLM基础设施以实现高效的语义驱动路径，促进网络表格理解是可行的。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.15845

KD4MT: A Survey of Knowledge Distillation for Machine Translation

KD4MT：机器翻译中的知识蒸馏综述

de Gibert, Ona, Attieh, Joseph, Mickus, Timothee, Scherrer, Yves, Tiedemann, Jörg

Abstract

Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.

Chinese Translation

知识蒸馏（Knowledge Distillation, KD）作为一个研究领域，近年来作为一种压缩工具获得了广泛关注，以应对自然语言处理（NLP）中日益庞大的模型所带来的挑战。值得注意的是，机器翻译（Machine Translation, MT）对此叙述提供了更为细致的视角：在机器翻译中，知识蒸馏不仅作为一种通用的知识转移机制，影响监督、翻译质量及效率。本文综述了机器翻译中的知识蒸馏（KD4MT），涵盖了105篇相关论文（截至2025年10月1日）。我们首先为非专家介绍机器翻译和知识蒸馏的基本概念，随后概述与机器翻译应用相关的标准知识蒸馏方法。接着，我们根据（i）方法论贡献和（ii）实际应用对KD4MT文献中的进展进行分类。我们的定性和定量分析识别了该领域的共同趋势，并突出了关键的研究空白，以及在机器翻译中缺乏统一评估实践的问题。我们进一步提供了在具体环境中选择知识蒸馏方法的实用指南，并强调了将知识蒸馏应用于机器翻译可能带来的风险，如幻觉增加和偏见放大。最后，我们讨论了大型语言模型（LLMs）在重新塑造KD4MT领域中的作用。为了支持进一步的研究，我们还提供了一个公开可用的数据库，总结了所调查的知识蒸馏方法的主要特征，以及一个关键术语的词汇表。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.15846

Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs

用于检查点兼容的语法注入的门控树交叉注意力机制在仅解码的大型语言模型中的应用

Gao, Xinyu, Wang, Shaonan, Ding, Nai

Abstract

Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.

Chinese Translation

仅解码的大型语言模型在广泛性能上表现强劲，但对轻微的语法扰动表现脆弱，降低了下游推理的可靠性。然而，直接将显式的语法结构注入现有的检查点可能会干扰其预训练能力。我们提出了一种检查点兼容的门控树交叉注意力（Gated Tree Cross-attention, GTCA）分支，该分支在不改变主干架构的情况下读取预计算的成分块内存。我们的设计使用了令牌更新掩码和分阶段训练，以控制结构更新的范围和时机。在各项基准测试和Transformer主干上，GTCA增强了语法的鲁棒性，超越了持续训练的基线，同时不影响多项选择问答性能或常识推理，为实现更具语法鲁棒性的仅解码大型语言模型提供了一条实用的检查点兼容路径。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.15847

Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

人格特质是否干扰？大型语言模型中的几何限制

Bhandari, Pranav, Naseem, Usman, Nasim, Mehwish

Abstract

Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.

Chinese Translation

大型语言模型（LLMs）中的人格引导通常依赖于注入特定特质的引导向量，隐含假设人格特质可以独立控制。在本研究中，我们通过分析五大人格特质引导方向之间的几何关系来检验这一假设是否成立。我们研究了从两个模型系列（LLaMA-3-8B 和 Mistral-8B）提取的引导向量，并应用了一系列几何条件方案，从无约束方向到软和硬正交化。我们的结果表明，人格引导方向表现出显著的几何依赖性：引导一个特质会持续引发其他特质的变化，即使在显式去除线性重叠的情况下也是如此。虽然硬正交化强制执行几何独立性，但并未消除跨特质的行为效应，并可能降低引导强度。这些发现表明，LLMs中的人格特质占据一个略微耦合的子空间，限制了完全独立的特质控制。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.15848

Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

大型语言模型能评估个性吗？验证对话式人工智能在特质剖析中的应用

Matšenas, Andrius, Lello, Anet, Lees, Tõnis, Peep, Hans, Tamm, Kim Lilii

Abstract

This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.

Chinese Translation

本研究验证了大型语言模型（LLMs）作为问卷基础个性评估的动态替代方案。通过一项被试内实验（N=33），我们比较了通过引导性LLM对话得出的五大人格特质分数与金标准IPIP-50问卷的结果，同时测量了用户感知的准确性。结果显示出中等的收敛效度（r=0.38-0.58），在尽责性、开放性和神经质分数方面，两种方法的结果在统计上是等效的。宜人性和外向性显示出显著差异，提示需要进行特质特定的校准。值得注意的是，参与者认为LLM生成的个性剖析与传统问卷结果同样准确。这些发现表明，对话式人工智能为传统心理测量学提供了一种有前景的新方法。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.15849

Preference Optimization for Review Question Generation Improves Writing Quality

评审问题生成的偏好优化提升写作质量

Sharma, Karun, Vats, Vidushee, Li, Shengzhi, Wang, Yuxiang, Sun, Zhongtian, Tiwari, Prayag

Abstract

Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50\% of their question tokens from a paper's first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

Chinese Translation

同行评审依赖于实质性、基于证据的问题，然而现有的基于大型语言模型（LLM）的方法往往生成表面层次的查询，超过50%的问题词汇来自论文的第一页。为了弥补这一差距，我们开发了IntelliReward，这是一种新颖的奖励模型，基于一个冻结的自回归LLM，并在最后50个词元状态上使用可训练的多头变换器，该模型在预测专家级人类偏好方面优于基于API的SFT基线。通过将解耦剪辑（Decoupled Clip）和动态采样策略优化（Dynamic Sampling Policy Optimization, DAPO）与IntelliReward结合，我们训练了IntelliAsk，一个与人类在努力、证据和基础方面的标准相一致的问题生成模型。我们在推理和写作基准测试中发现了一致的改进，表明评审者问题的质量与更广泛的能力相关。与Qwen3-32B基础模型相比，IntelliAsk在多种基准测试中显示出可测量的提升，特别是在推理任务（如MuSR，准确率68.3对比64.7）和复杂写作评估（如WritingBench，得分8.31对比8.07）中表现出色。我们发布了我们的实现、专家偏好注释和IntelliReward模型，以提供一个自动评估基准，用于评估LLM生成的评审问题中的基础、努力和证据。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.15850

Large Language Models for Assisting American College Applications

大型语言模型在美国大学申请中的辅助作用

Liu, Zhengliang, You, Weihang, Shu, Peng, Chen, Junhao, Pan, Yi, Jiang, Hanqi, Li, Yiwei, Ding, Zhaojun, Cao, Chao, Li, Xinliang, Zhou, Yifan, Zhang, Ruidong, Xu, Shaochen, Ruan, Wei, Zhao, Huaqin, Zhu, Dajiang, Liu, Tianming

Abstract

American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp-public/ezcollegeapp-public) to facilitate the broader impact of this work.

Chinese Translation

美国大学申请要求学生在碎片化的招生政策、重复且有条件的表格以及常常需要交叉参考多个来源的模糊问题中进行导航。我们提出了EZCollegeApp，一个基于大型语言模型（LLM）的系统，旨在通过结构化申请表、将建议答案基于权威的招生文件，并保持对最终回应的完全人类控制来辅助高中学生。该系统引入了一种以映射为先的范式，将表格理解与答案生成分开，从而实现跨异构申请门户的一致推理。EZCollegeApp整合了来自官方招生网站的文档摄取、增强检索的问答以及一个人机协作的聊天机器人界面，该界面在申请字段旁提供建议，而不进行自动提交。我们描述了系统架构、数据管道、内部表示、安全和隐私措施，以及通过自动化测试和人工质量评估进行的评估。我们的源代码已在GitHub上发布（https://github.com/ezcollegeapp-public/ezcollegeapp-public），以促进该工作的更广泛影响。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.15851

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

基于叙事理论的大型语言模型方法在自动故事生成与理解中的应用：一项综述

Liu, David Y., Joshi, Aditya, Dawson, Paul

Abstract

Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for 'narrative quality', we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

Chinese Translation

利用大型语言模型（LLMs）的叙事理论应用在自动故事生成和理解任务中展现出良好的前景。我们的综述考察了自然语言处理（NLP）研究如何与叙事研究领域相结合，并提出了一种分类法，以反映叙事学中的既定区分。我们发现了以下几个方面的模式：叙事数据集和任务、叙事理论与NLP流程，以及在提示和微调中的方法论趋势。我们强调了LLMs如何促进NLP流程与抽象叙事概念之间的便捷连接，以及跨学科合作的机会。在朝着任何统一的叙事相关任务定义或基准的努力中仍然存在挑战，这使得模型比较变得困难。对于未来的研究方向，我们认为，与其追求单一的、普遍的“叙事质量”基准，进展更有利于集中在以下几个方面：定义和改进基于理论的个体叙事属性指标，以逐步提升模型性能；进行大规模的、以理论为驱动的文学/社会/文化分析；以及创建实验，使得输出可以用于验证或完善叙事理论。这项工作为在NLP领域进行更系统和理论驱动的叙事研究提供了背景基础，概述了正在进行的研究努力和更广泛的叙事研究领域。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.15852

Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

在时间泄漏约束下构建安全可部署的临床自然语言处理

Cho, Ha Na, Sutari, Sairam, Lopez, Alexander, Bow, Hansen, Zheng, Kai

Abstract

Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

Chinese Translation

临床自然语言处理（NLP）模型在利用叙述性临床文档支持医院出院规划方面展现了良好的前景。然而，基于笔记的模型特别容易受到时间和词汇泄漏的影响，其中文档伪影编码了未来的临床决策，从而夸大了表面预测性能。这种行为在实际部署中带来了重大风险，因为过于自信或时间上无效的预测可能会干扰临床工作流程并危及患者安全。本研究重点关注在时间泄漏约束下构建安全可部署的临床NLP所需的系统级设计选择。我们提出了一种轻量级审计流程，将可解释性整合到模型开发过程中，以在最终训练之前识别和抑制易受泄漏影响的信号。以择期脊柱手术后的次日出院预测为案例研究，我们评估了审计如何影响预测行为、校准以及与安全相关的权衡。结果表明，经过审计的模型展现出更保守且更好校准的概率估计，并减少了对出院相关词汇线索的依赖。这些发现强调，准备部署的临床NLP系统应优先考虑时间有效性、校准和行为稳健性，而非乐观的性能。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.15853

A Lightweight Explainable Guardrail for Prompt Safety

一种轻量级可解释的安全防护措施用于提示安全性

Islam, Md Asiful, Surdeanu, Mihai

Abstract

We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

Chinese Translation

我们提出了一种轻量级可解释的安全防护措施（LEG）方法，用于不安全提示的分类。LEG采用多任务学习架构，联合学习提示分类器和解释分类器，其中后者标记解释安全/不安全整体决策的提示词。LEG使用合成数据进行可解释性训练，这些数据是通过一种新颖的策略生成的，以抵消大型语言模型（LLMs）的确认偏差。最后，LEG的训练过程使用了一种新颖的损失函数，该函数捕捉全局解释信号，并结合交叉熵和聚焦损失与基于不确定性的加权。尽管其模型规模明显小于当前方法，LEG在三个数据集上，无论是在领域内还是领域外，在提示分类和可解释性方面均获得了与最先进技术相当或更好的性能。如果被接受，我们将公开发布所有模型和注释数据集。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.15854

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

通过目标导向偏好优化实现任务导向对话中的策略解耦与执行

Xu, Jingyi, Ren, Xingyu, You, Zhiqiang, Zhang, Yumeng, Shou, Zhoupeng

Abstract

Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

Chinese Translation

大型语言模型在任务导向对话系统中展现出潜力，但现有的训练方法往往依赖于基于标记的似然性或偏好优化，这与长时间任务成功的目标不够一致。为了解决这一问题，我们提出了目标导向偏好优化（Goal-Oriented Preference Optimization, GOPO），这是一种层次化的强化学习框架，通过专家代理（Expert Agent）和客户服务代理（Customer Service Agent）将策略规划与响应生成解耦。专家代理在对话轨迹层面优化多轮目标偏好，而客户服务代理则严格按照所选策略生成响应。我们在公共基准测试和电子商务客户服务数据集上评估了GOPO，并引入了任务导向序列参与度（Task-focused Sequential Engagement, TSE），这是一个基于真实电子商务交互数据的序列级指标。在Mgshop数据集中，GOPO相比于PPO和Memento分别提高了7.7%和10.3%的TSE，并在序列级奖励和生成质量上持续取得进展。此外，使用GOPO训练的14B模型在TSE上比Qwen-235B和GPT-5.2分别高出2.7%和1.5%。消融研究确认了专家代理在长时间优化中的关键作用。GOPO在其他数据集上也显示出一致的改进。本研究为商业场景中的任务导向对话系统建立了一个新的范式，相关代码和数据集将公开。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.15856

Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

重新思考检索增强生成中的软压缩：一种基于查询条件的选择器视角

Liu, Yunhao, Jia, Zian, Gao, Xinyu, Xu, Kanjun, Xiong, Yun

Abstract

Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.

Chinese Translation

检索增强生成（RAG）有效地将大型语言模型（LLMs）与外部知识结合，并广泛应用于与网络相关的任务。然而，其可扩展性受到过长上下文和冗余检索的阻碍。近期关于软上下文压缩的研究旨在通过将长文档编码为紧凑的嵌入来解决这一问题，但由于依赖于类似自编码器的全压缩方法，迫使编码器压缩所有文档信息而不考虑与输入查询的相关性，导致其性能往往低于非压缩的RAG。在本研究中，我们对这一范式进行了分析，并揭示了两个基本限制：（I）不可行性：全压缩与LLM的下游生成行为相冲突；（II）非必要性：全压缩是不必要的，并稀释了与任务相关的信息密度。基于这些见解，我们引入了SeleCom，一个基于选择器的软压缩框架，用于RAG，重新定义了编码器的角色为基于查询条件的信息选择器。该选择器仅为解码器，并通过一个庞大、多样化且难度分级的合成问答数据集进行课程学习训练。大量实验表明，SeleCom显著优于现有的软压缩方法，并在计算和延迟方面减少了33.8%~84.6%，同时在性能上与非压缩基线相当或更优。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.15857

Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach

通过协同推理与自适应融合进行多源异构舆情分析：一种系统集成的方法

Liu, Yi

Abstract

The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.

Chinese Translation

来自多个异构源的舆情分析由于结构差异、语义变化和平台特定偏见而面临重大挑战。本文提出了一种新颖的协同推理与自适应融合（Collaborative Reasoning and Adaptive Fusion, CRAF）框架，该框架通过结构化的多阶段推理机制系统地将传统特征基础方法与大型语言模型（Large Language Models, LLMs）集成在一起。我们的方法具有四个关键创新：（1）一种跨平台协同注意模块，该模块在保留源特定特征的同时对齐语义表示；（2）一种层次自适应融合机制，该机制根据数据质量和任务需求动态加权特征；（3）一种联合优化策略，通过共享潜在空间同时学习主题表示和情感分布；（4）一种新颖的多模态提取能力，通过集成光学字符识别（OCR）、自动语音识别（ASR）和视觉情感分析，处理来自抖音和快手等平台的视频内容。理论分析表明，与独立源建模相比，CRAF实现了更紧的泛化界限，减少了O(sqrt(d log K / m))，其中d为特征维度，K为源的数量，m为样本大小。在三个多平台数据集（微博-12、跨平台-15、新闻论坛-8）上的综合实验表明，CRAF实现了平均主题聚类调整兰德指数（Adjusted Rand Index, ARI）为0.76（比最佳基线提高4.1%）和情感分析F1-score为0.84（提高3.8%）。该框架表现出强大的跨平台适应性，将新平台对标记数据的需求减少了75%。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.15858

State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

状态设计的重要性：表征如何影响大型语言模型中的动态推理

Wong, Annie, Plaat, Aske, Bäck, Thomas, van Stein, Niki, Kononova, Anna V.

Abstract

As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

Chinese Translation

随着大型语言模型（LLMs）从静态推理任务转向动态环境，它们的成功依赖于在推理时能够导航和响应不断变化的环境。在这些设置中，一个未被充分探讨的因素是状态的表征。在固定模型参数的情况下，我们系统地改变了三个关键方面：（1）状态粒度（长文本与摘要），（2）结构（自然语言与符号），以及（3）空间基础（仅文本与图像或文本地图编码），并在顺序决策基准测试中进行比较。我们发现，轨迹摘要通过减少噪声和稳定长时间推理来提高性能。其次，自然语言表征在各模型中表现出最强的鲁棒性，而结构化编码主要对具有强代码或结构化输出先验的模型（如 JSON 模式）有帮助。第三，尽管图像输入显示出一些好处，但基于文本的空间编码被证明是最有效的。这一优势并不源于空间信息本身，而是源于构建的过程，这促使模型进行静态输入无法引发的空间推理。总体而言，我们证明了状态表征的设计选择是影响性能的决定性因素，这与信息本身的可用性是不同的。然而，我们注意到，即使在改进表征的情况下，当前的 LLMs 和 VLMs 在长时间范围内仍然脆弱，特别是在它们必须综合信息以管理多个子任务以达到目标时。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.15859

From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

从转录文本到人工智能助手：对话式人工智能助手的知识提取、检索增强生成集成与稳健评估

Pachtrachai, Krittin, Pornpichitsuwan, Petmongkon, Modecrua, Wachiravit, Kraisingkorn, Touchapon

Abstract

Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

Chinese Translation

为客户导向行业构建可靠的对话式人工智能助手仍然面临挑战，这主要是由于嘈杂的对话数据、碎片化的知识以及对准确人机交接的需求，尤其是在高度依赖实时信息的领域。本文提出了一种从历史通话转录文本直接构建和评估对话式人工智能助手的端到端框架。首先，使用简化版的PIPA框架对输入的转录文本进行评分，重点关注观察对齐和适当的响应行为，并过滤出仅保留高质量的互动，这些互动展现出连贯的流程和有效的人类代理响应。然后，利用大型语言模型（LLMs）从精心策划的转录文本中提取结构化知识，并将其作为检索增强生成（RAG）管道中的唯一基础来源。助手的行为通过系统的提示调优进行管理，从单一提示逐步过渡到精简、模块化和受控的设计，以确保一致性、安全性和可控执行。评估通过基于转录的用户模拟器进行，能够定量测量通话覆盖率、事实准确性和人类升级行为。额外的红队测试评估了对提示注入、超范围和超上下文攻击的稳健性。实验在房地产和专业招聘领域进行，这些领域故意具有挑战性，并且由于依赖实时数据而目前不适合自动化。尽管存在这些限制，助手仍能自主处理约30%的通话，达到近乎完美的事实准确性和拒绝行为，并在对抗性测试中展现出强大的稳健性。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.15860

Reranker Optimization via Geodesic Distances on k-NN Manifolds

基于k-NN流形的测地距离重排序优化

Gong, Wen G.

Abstract

Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M << N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.

Chinese Translation

当前用于检索增强生成（RAG）的神经重排序方法依赖于交叉编码器或大型语言模型（LLMs），需要大量计算资源，并且每个查询的延迟为3-5秒。我们提出了Maniscope，一种几何重排序方法，它计算在检索到的文档候选上构建的k-最近邻（k-NN）流形上的测地距离。该方法结合了全局余弦相似度和局部流形几何，以捕捉平坦欧几里得度量所遗漏的语义结构。在八个BEIR基准数据集（1,233个查询）上的评估中，Maniscope在三个最困难的数据集上（NFCorpus: +7.0%，TREC-COVID: +1.6%，AorB: +2.8% NDCG@3）超越了基于HNSW图的基线，同时速度快了3.2倍（平均4.7毫秒对比14.8毫秒）。与交叉编码器重排序器相比，Maniscope在延迟降低10-45倍的情况下，准确率保持在2%以内。在TREC-COVID上，LLM-Reranker仅比Maniscope提高了+0.5% NDCG@3，但延迟高出840倍，这使得Maniscope成为实时RAG部署的实用替代方案。该方法的复杂度为O(N D + M^2 D + M k log k)，其中M << N，能够实现低于10毫秒的延迟。我们计划将Maniscope作为开源软件发布。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.15861

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

CAST：实现基于大型语言模型的稳定文本分析以支持数据分析

Xie, Jinxiang, Li, Zihao, He, Wei, Ding, Rui, Han, Shi, Zhang, Dongmei

Abstract

Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.

Chinese Translation

表格数据的文本分析依赖于两个核心操作： extit{摘要}用于语料库级主题提取，以及 extit{标记}用于行级标签。使用大型语言模型（LLMs）进行这些任务的一个关键限制是它们无法满足数据分析所要求的高输出稳定性标准。为了解决这一挑战，我们提出了 extbf{CAST}（ extbf{C}onsistency via extbf{A}lgorithmic Prompting and extbf{S}table extbf{T}hinking），一个通过约束模型的潜在推理路径来增强输出稳定性的框架。CAST结合了（i）算法提示（Algorithmic Prompting），为有效的推理转变施加程序性框架，以及（ii）思考后发言（Thinking-before-Speaking），在最终生成之前强制执行明确的中间承诺。为了衡量进展，我们引入了 extbf{CAST-S}和 extbf{CAST-T}，分别用于项目摘要和标记的稳定性指标，并验证它们与人类判断的一致性。在多个大型语言模型基础上的公开基准测试中，实验表明CAST在所有基线中始终实现了最佳的稳定性，稳定性得分提高了最多16.2\%，同时保持或提高了输出质量。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.15862

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

增强语义基础的食谱生成中的动作和成分建模

Liu, Guoshan, Zhu, Bin, Li, Yian, Chen, Jingjing, Ngo, Chong-Wah, Jiang, Yu-Gang

Abstract

Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

Chinese Translation

最近在多模态大型语言模型（MLMMs）方面的进展使得从食物图像生成食谱成为可能，然而，尽管输出的词汇得分（如BLEU、ROUGE）较高，生成的内容往往仍包含语义上不正确的动作或成分。为了解决这一问题，我们提出了一种语义基础框架，该框架预测并验证动作和成分，作为指令生成的内部上下文。我们的两阶段流程结合了监督微调（SFT）和强化微调（RFT）：SFT利用动作推理数据集和成分语料库构建基础准确性，而RFT则采用频率感知奖励来改善长尾动作预测和成分泛化。语义置信评分与修正（SCSR）模块进一步过滤和修正预测结果。在Recipe1M上的实验显示出最先进的性能，并显著提高了语义一致性。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.15863

Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

不是例子，而是过程：自生成示例如何增强大型语言模型的推理能力

Gwak, Daehoon, Jung, Minseo, Park, Junwoo, Park, Minho, Park, ChaeHun, Hyung, Junha, Choo, Jaegul

Abstract

Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.

Chinese Translation

最近的研究表明，大型语言模型（LLMs）可以通过自生成的少量示例提高其推理性能，取得与手动策划的上下文示例相当的结果。然而，这些提升背后的机制仍不清楚，使得在何时以及如何有效应用该技术变得困难。在本研究中，我们认为关键的好处并不来自生成的示例本身，而是来自创建这些示例的过程。为了验证这一点，我们在多种LLM架构的推理密集型任务上系统地评估了三种上下文学习的提示策略：（1）零-shot 提示；（2）集成提示，在这种情况下，LLMs在单一统一的提示中创建并解决问题；（3）解耦提示，在这种情况下，自生成的示例被重复使用作为上下文示例，但其创建的上下文本身被排除。我们在五种广泛使用的模型架构上进行实验，证明集成提示始终优于零-shot 和解耦提示。相比之下，解耦提示仅比零-shot 提供了边际收益。此外，为了进行更深入的分析，我们进行了注意力分析，并观察到集成提示与解耦提示之间的注意力模式存在显著差异。这些发现表明，自生成提示的优势来自于问题创建的过程，而不是示例本身，为设计更有效的提示策略提供了宝贵的见解。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.15866

NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

社交媒体中的自然语言处理隐私风险识别 (NLP-PRISM)：一项调查

Goswami, Dhiman, Kumar, Jai Kruthunz Naveen, Das, Sanchari

Abstract

Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.

Chinese Translation

自然语言处理 (NLP) 是社交媒体分析的重要组成部分，但它通常处理包含个人可识别信息 (PII)、行为线索和元数据的内容，这些内容引发了监控、画像和定向广告等隐私风险。为了系统性地评估这些风险，我们回顾了203篇同行评审的论文，并提出了社交媒体中的自然语言处理隐私风险识别框架 (NLP-PRISM)，该框架从数据收集、预处理、可见性、公平性、计算风险和合规性六个维度评估脆弱性。我们的分析显示，变换器模型的F1分数范围为0.58-0.84，但在隐私保护微调下，性能下降了1%-23%。通过使用NLP-PRISM，我们考察了六个NLP任务中的隐私覆盖情况：情感分析（16）、情绪检测（14）、攻击性语言识别（19）、代码混合处理（39）、母语识别（29）和方言检测（24），揭示了隐私研究中的显著差距。我们还发现模型效用的权衡（减少了2%-9%），MIA AUC（成员推断攻击）为0.81，AIA准确率为0.75（属性推断攻击）。最后，我们倡导加强匿名化、隐私意识学习和公平驱动训练，以促进社交媒体环境中的伦理NLP。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.15867

Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

与人工智能的互动：先进的大型语言模型在1977年文本冒险游戏《佐克》中的表现如何？

Gerrits, Berry

Abstract

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

Chinese Translation

在本文中，我们通过评估当代大型语言模型（LLMs）在1977年首次发布的开创性文本冒险游戏《佐克》中的表现，来探讨其问题解决和推理能力。该游戏基于对话的结构为评估基于LLM的聊天机器人如何理解自然语言描述并生成适当的行动序列以在游戏中取得成功提供了一个受控环境。我们在最小和详细指令下测试了领先的专有模型——ChatGPT、Claude和Gemini——并通过取得的分数作为主要指标来衡量游戏进展。我们的结果显示，所有测试模型的平均完成率均低于10%，即使是表现最好的模型（Claude Opus 4.5）也仅获得约75分（满分350分）。值得注意的是，提供详细的游戏指令并没有改善表现，启用“扩展思维”也未见成效。对模型推理过程的定性分析揭示了根本性的局限性：重复的不成功行动表明缺乏对自身思维的反思，策略的一致性缺乏持久性，以及尽管可以访问对话历史却未能从之前的尝试中学习。这些发现表明当前LLMs在元认知能力和文本游戏领域的问题解决能力方面存在重大局限性，提出了关于其推理能力的性质和范围的疑问。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.15868

Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

理解大型语言模型的失败：基于多带图灵机的语言模型推理系统性错误分析

Boman, Magnus

Abstract

Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.

Chinese Translation

大型语言模型（LLMs）在看似简单的任务上表现出失败模式。我们提出了一种使用确定性多带图灵机形式化LLM交互的方法，其中每个带代表一个独立的组件：输入字符、标记、词汇、模型参数、激活值、概率分布和输出文本。该模型能够精确定位失败模式到特定的处理阶段，揭示例如标记化如何模糊计数任务所需的字符级结构。该模型阐明了为何像链式思维提示（chain-of-thought prompting）这样的技术有效，因为它将计算外部化到输出带上，同时也揭示了它们的基本局限性。这种方法提供了一种严格、可证伪的替代几何隐喻的方式，并通过原则性的错误分析补充了经验性扩展法则。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.15869

Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

迈向公平与高效的去标识化：量化去标识化方法的效率与通用性

Zambare, Noopur, Aghakasiri, Kiana, Lin, Carissa, Ye, Carrie, Mitchell, J. Ross, Abdalla, Mohamed

Abstract

Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291

Chinese Translation

大型语言模型（LLMs）在临床去标识化任务中表现出色，该任务旨在识别敏感标识符以保护隐私。然而，之前的研究并未考察其在不同格式、文化和性别之间的通用性。在本研究中，我们系统地评估了微调的变换器模型（BERT、ClinicalBERT、ModernBERT）、小型LLMs（Llama 1-8B、Qwen 1.5-7B）和大型LLMs（Llama-70B、Qwen-72B）在去标识化方面的表现。我们发现，小型模型在性能上与大型模型相当，同时显著降低了推理成本，使其在实际部署中更具可行性。此外，我们展示了小型模型可以通过有限的数据进行微调，从而在去标识化来自普通话、印地语、西班牙语、法语、孟加拉语和地区性英语变体的标识符时，超越大型模型，包括性别化名称。为了提高多文化背景下的鲁棒性，我们推出并公开发布了BERT-MultiCulture-DEID，这是基于BERT、ClinicalBERT和ModernBERT的一组去标识化模型，经过在MIMIC上微调，涵盖多个语言变体的标识符。我们的研究结果首次全面量化了去标识化中的效率与通用性之间的权衡，并为公平与高效的临床去标识化建立了实际路径。有关访问模型的详细信息，请访问：https://doi.org/10.5281/zenodo.18342291

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.15870

VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

VDLM：通过鲁棒潜在到文本渲染的可变扩散语言模型

Qu, Shuhui

Abstract

Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.

Chinese Translation

自回归语言模型以不可逆的承诺从左到右解码，这限制了多步骤推理过程中的修正。我们提出了 extbf{VDLM}，一种模块化的可变扩散语言模型，它将语义规划与文本渲染分离。VDLM在语义变量嵌入上应用LLaDA风格的掩蔽扩散，以便在潜在空间中实现迭代优化，然后使用嵌入空间奖励和价值进行轨迹感知优化对规划器进行后训练，避免在强化学习循环中进行文本解码。为了将规划的嵌入转换回文本，我们使用 extbf{Vec2Text}渲染器，并引入 extbf{嵌入扰动}以增强在规划噪声下的解码鲁棒性。在涵盖一般推理、数学和代码的九个基准测试中，VDLM在预训练方面具有竞争力，并在长文本生成任务上显著提高了后训练效果，超越了其他基准。这些结果突显了嵌入空间后训练和鲁棒潜在到文本渲染在扩散语言建模中的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.15871

CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

CheckIfExist：在人工智能生成内容时代检测引用幻觉

Abbonato, Diletta

Abstract

The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination -- the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents "CheckIfExist", an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.

Chinese Translation

大型语言模型（LLMs）在学术工作流程中的广泛应用带来了前所未有的文献完整性挑战，特别是通过引用幻觉——生成看似合理但实际上不存在的引用。最近的研究记录了即使在如NeurIPS和ICLR等顶级机器学习会议上接受的论文中也存在AI幻觉引用，这突显了自动验证机制的紧迫性。本文提出了“CheckIfExist”，一个开源的基于网络的工具，旨在通过与CrossRef、Semantic Scholar和OpenAlex学术数据库的多源验证，提供对文献引用的即时验证。虽然现有的参考管理工具提供文献组织功能，但并未提供引用真实性的实时验证。尽管商业幻觉检测服务日益增多，但通常对免费层施加限制性使用限制或要求支付高额订阅费用。所提工具通过采用级联验证架构和字符串相似性算法来计算多维匹配置信度分数，从而填补了这一空白，提供引用真实性的即时反馈。该系统支持单个引用验证和通过统一接口批量处理BibTeX条目，在几秒钟内返回经过验证的APA引用和可导出的BibTeX记录。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.15874

P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

P-RAG：结合LoRA和选择性链式思维的增强提示参数化RAG用于生物医学和多跳问答

Lyu, Xingda, Lyu, Gongfu, Yan, Zitai, Jiang, Yuxin

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG's potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.

Chinese Translation

大型语言模型（LLMs）展现出卓越的能力，但仍然受到其对静态训练数据依赖的限制。检索增强生成（RAG）通过在推理过程中检索外部知识来解决这一限制，尽管它仍然在很大程度上依赖于知识库的质量。为了探索潜在的改进，我们评估了三种RAG变体——标准RAG、DA-RAG，以及我们提出的增强提示参数化RAG（P-RAG），这是一种混合架构，将参数化知识与LLM和检索证据相结合，受链式思维（CoT）提示和低秩适应（LoRA）微调的指导，适用于一般和生物医学数据集。我们使用通过LoRA微调的LLaMA-3.2-1B-Instruct，在PubMedQA和2WikiMultihopQA上进行评估。P-RAG在PubMedQA上的F1得分比标准RAG高出10.47个百分点（93.33%对82.86%；相对提升12.64%）。在2WikiMultihopQA上，P-RAG的整体得分几乎是标准RAG的两倍（33.44%对17.83%），在Compare子集上达到44.03%（其中42.74%为Bridge，21.84%为Inference，8.60%为Compose）。CoT提示显著提高了多跳推理能力，但对简单的单跳查询结果不一。这些发现突显了P-RAG在准确、可扩展和上下文自适应的生物医学问答中的潜力。我们的贡献包括：（1）基于LoRA的LLaMA-3.2-1B-Instruct的生物医学问答微调，（2）引入结合链式思维提示的P-RAG，以及（3）在PubMedQA和2WikiMultihopQA上取得的最先进结果。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.15894

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

质量约束的熵最大化策略优化以提升大型语言模型的多样性

Pan, Haihui, Hong, Yuzhong, Lv, Shaoke, Bao, Junwei, Jiang, Hongfei, Song, Yang

Abstract

Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.

Chinese Translation

近期研究表明，尽管对齐方法显著提高了大型语言模型（LLM）输出的质量，但同时也降低了模型输出的多样性。虽然已有一些方法被提出以增强LLM输出的多样性，但通常会以降低性能为代价。在本研究中，我们首先理论上证明了对齐任务可以分解为两个分布：质量和多样性。为了在确保质量的同时增强LLM输出的多样性，我们提出了质量约束的熵最大化策略优化（Quality-constrained Entropy Maximization Policy Optimization, QEMPO）。QEMPO旨在最大化策略的输出熵，同时确保输出质量。通过对QEMPO添加不同的约束，我们获得了不同的策略。为了优化这些策略，我们提出了在线和离线训练方法。实验验证了QEMPO在提升输出多样性的同时，其性能可与强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）相媲美，甚至更优。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.15895

Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

理解再记忆：一种基于认知要旨驱动的全球语义扩散RAG框架

Zhou, Pengcheng, Li, Haochen, Nie, Zhiqiang, Chen, JiaLe, Gong, Qing, Zhang, Weizhen, Yu, Chun

Abstract

Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.

Chinese Translation

检索增强生成（RAG）通过整合外部知识有效减轻了大语言模型（LLMs）中的幻觉现象。然而，现有框架中文本的离散表示固有地导致了语义完整性的丧失，从而引发检索偏差。受到人类情节记忆机制的启发，我们提出了CogitoRAG，这是一种模拟人类认知记忆过程的RAG框架。该框架的核心在于对语义要旨的提取和演化。在离线索引阶段，CogitoRAG首先将非结构化语料库推导为要旨记忆语料库，然后将其转化为一个多维知识图谱，整合实体、关系事实和记忆节点。在在线检索阶段，该框架通过查询分解模块处理复杂查询，将其分解为全面的子查询，模拟人类在处理复杂信息时采用的认知分解方式。随后，实体扩散模块在图谱中执行关联检索，受结构相关性和实体频率奖励机制的指导。此外，我们提出了CogniRank算法，通过将扩散衍生得分与语义相似性融合，精确地对候选段落进行重新排序。最终证据以段落-记忆配对格式传递给生成器，提供高密度的信息支持。在五个主流问答基准和GraphBench上的多任务生成实验结果表明，CogitoRAG显著优于最先进的RAG方法，展现出在复杂知识整合和推理方面的卓越能力。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.15896

Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens

每一小步都有帮助：构建基于细粒度可转移多模态令牌的知识图谱基础模型

Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Zhang, Wen, Chen, Huajun

Abstract

Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.

Chinese Translation

多模态知识图谱推理（MMKGR）旨在通过利用图结构信息和多模态实体内容来预测缺失的链接。现有的大多数研究都是针对传导性设置设计的，这种方法学习特定于数据集的嵌入，并且难以推广到新的知识图谱（KGs）。近期的知识图谱基础模型（KGFMs）改善了跨KG的迁移能力，但它们主要利用结构模式而忽视了丰富的多模态信号。我们通过提出一种基于令牌的基础模型（TOFU）来解决这些问题，该模型在不同的多模态知识图谱（MMKGs）中表现出强大的泛化能力。TOFU将结构、视觉和文本信息离散化为特定于模态的令牌。然后，TOFU采用具有混合消息机制的层次融合架构，旨在处理这些令牌并获得可转移的特征用于MMKGR。在17个传导性、归纳性和完全归纳性的MMKGs上的实验结果表明，TOFU始终优于强大的KGFM和MMKGR基线，在未见过的MMKGs上表现出色。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.15897

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

通过令牌模糊化减轻语言模型中的梯度反演风险

Feng, Xinguo, Ma, Zhongkui, Wang, Zihan, Abuadbba, Alsharif, Bai, Guangdong

Abstract

Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

Chinese Translation

大规模语言模型的训练和微调在很大程度上受益于协作学习，但这种方法已被证明容易受到梯度反演攻击（GIA）的威胁，攻击者可以通过共享梯度重构私有训练数据。现有的防御主要采用梯度扰动技术，例如噪声注入或梯度修剪，以干扰GIA从梯度空间到令牌空间的直接映射。然而，由于在梯度、嵌入和令牌空间之间保留了语义相似性，这些方法往往效果不佳。在本研究中，我们提出了一种新颖的防御机制，称为GHOST（带有模糊令牌的梯度保护），这是一种令牌级模糊化机制，通过解耦梯度、嵌入和令牌空间之间的内在联系来中和GIA。GHOST建立在一个重要的见解之上：由于令牌空间的规模庞大，存在语义上不同但嵌入上接近的令牌，可以作为原始令牌的影子替代品，从而在令牌空间中实现语义断开，同时在嵌入和梯度空间中保留连接。GHOST包括一个搜索步骤，通过多标准搜索过程识别语义上不同的候选令牌，以及一个选择步骤，选择最佳影子令牌，以确保通过保留与原始令牌产生的内部输出的对齐，最小化对训练关键特征的干扰。在不同模型架构（从BERT到Llama）和数据集上的评估表明，GHOST在保护隐私（恢复率低至1%）和保持效用（分类F1高达0.92，困惑度高达5.45）方面表现出显著的有效性，能够在分类和生成任务中抵御最先进的GIA和自适应攻击场景。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.15898

MultiCube-RAG for Multi-hop Question Answering

用于多跳问答的 MultiCube-RAG

Shi, Jimeng, Hu, Wei, Tian, Runchu, Jin, Bowen, Kweon, Wonbin, Kang, SeongKu, Kang, Yunfan, Ye, Dingqi, Zhou, Sizhe, Wang, Shaowen, Han, Jiawei

Abstract

Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

Chinese Translation

多跳问答（QA）需要在相互关联的主题、属性和关系之间进行多步骤推理和检索。现有的增强检索生成（RAG）方法在准确捕捉这些结构语义方面存在困难，导致性能不佳。基于图的 RAG 将这些信息结构化为图形，但生成的图形往往噪声较大且计算开销高。此外，大多数方法依赖于单步检索，忽视了多跳推理过程的需求。最近的基于训练的方法试图激励大型语言模型（LLMs）进行迭代推理和检索，但其训练过程容易出现不稳定收敛和高计算开销。为了解决这些局限性，我们设计了一种基于本体的立方体结构，具有多个正交维度，以建模结构化的主题、属性和关系。在立方体结构的基础上，我们提出了 MultiCube-RAG，这是一种无训练的方法，由多个立方体组成，用于多步骤推理和检索。每个立方体专注于建模一类主题，从而使 MultiCube-RAG 灵活选择最合适的立方体，以精确获取相关知识。为了增强基于查询的推理和检索，我们的方法将复杂的多跳查询分解为一组沿立方体维度的简单子查询，并依次解决每个子查询。在四个多跳 QA 数据集上的实验表明，MultiCube-RAG 在响应准确性上比各种基线的平均性能提高了 8.9%。值得注意的是，我们还证明了我们的方法在效率和内在可解释性方面表现更佳。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.15902

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Doc-to-LoRA：学习即时内化上下文

Charakorn, Rujikorn, Cetin, Edoardo, Uesaka, Shinnosuke, Lange, Robert Tjarko

Abstract

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

Chinese Translation

长输入序列是大语言模型（LLMs）中上下文学习、文档理解和多步推理的核心。然而，变换器的二次注意力成本使得推理在内存上变得密集且缓慢。虽然上下文蒸馏（CD）可以将信息转移到模型参数中，但逐提示蒸馏由于训练成本和延迟而不切实际。为了解决这些限制，我们提出了Doc-to-LoRA（D2L），一个轻量级超网络，能够在单次前向传递中进行近似的上下文蒸馏。给定一个未见过的提示，D2L为目标LLM生成一个LoRA适配器，使得后续查询能够在不重新消耗原始上下文的情况下得到回答，从而减少目标LLM推理过程中的延迟和KV缓存内存消耗。在一个长上下文的针在干草堆任务中，D2L成功学习将上下文映射到存储针信息的适配器，实现了在序列长度超过目标LLM原生上下文窗口4倍以上的情况下，近乎完美的零-shot准确率。在计算资源有限的真实世界问答数据集上，D2L的表现优于标准的上下文蒸馏，同时显著降低了峰值内存消耗和更新延迟。我们设想D2L可以促进LLMs的快速适应，开启频繁知识更新和个性化聊天行为的可能性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.15958

DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

DocSplit：用于文档包识别与拆分的综合基准数据集与评估方法

Islam, Md Mofijul, Salekin, Md Sirajus, Balakrishnan, Nivedha, Bishop III, Vincil C., Jain, Niharika, Romo, Spencer, Strahan, Bob, Xie, Boyi, Socolinsky, Diego A.

Abstract

Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

Chinese Translation

在现实应用中，文档理解通常需要处理异构的多页文档包，这些文档包包含多个拼接在一起的文档。尽管在视觉文档理解方面取得了近期进展，但文档包拆分这一基本任务，即将文档包分离为单个单位，仍然在很大程度上未得到解决。我们提出了第一个综合基准数据集DocSplit，并提供了新的评估指标，以评估大型语言模型在文档包拆分能力上的表现。DocSplit包含五个复杂性各异的数据集，涵盖多种文档类型、布局和多模态设置。我们正式定义了DocSplit任务，该任务要求模型识别文档边界、分类文档类型，并在文档包内保持正确的页面顺序。该基准针对现实世界中的挑战，包括页面顺序混乱、文档交错以及缺乏清晰标记的文档。我们在我们的数据集上进行了广泛的实验，评估多模态大型语言模型，揭示了当前模型在处理复杂文档拆分任务时存在显著的性能差距。DocSplit基准数据集和提出的新评估指标为推进文档理解能力提供了系统框架，这对于法律、金融、医疗和其他文档密集型领域至关重要。我们发布这些数据集以促进未来在文档包处理方面的研究。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.16023

A Curious Class of Adpositional Multiword Expressions in Korean

韩语中一种有趣的介词多词表达类

Min, Junghyun, Han, Na-Rae, Hwang, Jena D., Schneider, Nathan

Abstract

Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.

Chinese Translation

多词表达（MWEs）在跨语言注释框架中得到了广泛研究，例如PARSEME。然而，韩语的多词表达在这些研究中仍然代表性不足。特别是，韩语的多词介词缺乏系统分析、注释资源以及与现有多语言框架的整合。本文研究了一类韩语功能性多词表达：基于后置词的动词构造（PVCs）。我们使用来自韩语维基百科的数据，对几种PVC表达进行调查和分析，并将其与具有相似结构的非多词表达和轻动词构造（LVCs）进行对比。在此分析基础上，我们提出了注释指南，旨在支持未来在韩语多词介词方面的研究，并促进与跨语言框架的对齐。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.16054

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

CLAA：跨层注意力聚合以加速大规模语言模型的预填充

McDanel, Bradley, Li, Steven, Khaitan, Harshit

Abstract

The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.

Chinese Translation

在长上下文大规模语言模型（LLM）推理中，预填充阶段仍然是一个计算瓶颈。最近的令牌排名启发式方法通过选择性地处理一部分语义相关的令牌来加速推理。然而，现有方法在令牌重要性估计上存在不稳定性，通常在不同层之间变化。独立于特定启发式架构评估令牌排名质量是具有挑战性的。为了解决这个问题，我们引入了一个答案引导的oracle，它通过测量从生成答案到提示的注意力来定义真实的令牌重要性。这个oracle揭示了现有启发式方法在不同层之间表现出高方差：在特定层次上排名可能会急剧下降，这种失败模式在端到端基准测试中是不可见的。诊断结果建议一个简单的修复方案：跨层聚合分数，而不是依赖于任何单一层。我们将其实现为跨层注意力聚合（CLAA），该方法缩小了与oracle上界的差距，并将首次令牌时间（TTFT）与完整KV缓存基线相比减少了多达39%。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.16080

Surgical Activation Steering via Generative Causal Mediation

通过生成因果中介进行手术激活引导

Sankaranarayanan, Aruna, Zur, Amir, Geiger, Atticus, Hadfield-Menell, Dylan

Abstract

Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

Chinese Translation

我们应该在语言模型（LM）中何处干预，以控制在长篇响应的多个标记中扩散的行为？我们提出了生成因果中介（Generative Causal Mediation, GCM），这是一种选择模型组件（例如，注意力头）以引导二元概念（例如，诗歌对话与散文对话）的方法，从对比长篇响应中提取。在GCM中，我们首先构建一个对比输入和响应的数据集。然后，我们量化各个模型组件如何介导对比概念，并选择最强的中介进行引导。我们在三个任务上评估GCM——拒绝、谄媚和风格转移——并在三个语言模型上进行测试。GCM成功地定位了在长篇响应中表达的概念，并在使用稀疏的注意力头进行引导时，始终优于基于相关探针的基线。综合来看，这些结果表明，GCM为定位和控制语言模型的长篇响应提供了一种有效的方法。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.16085

Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

语言统计与错误信念推理：来自41个开放权重语言模型的证据

Trott, Sean, Taylor, Samuel, Jones, Cameron, Michaelov, James A., Rivière, Pamela D.

Abstract

Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

Chinese Translation

对语言模型（LMs）中的心理状态推理的研究有潜力为人类社会认知理论提供信息——例如，心理状态推理部分源于语言暴露的理论——以及我们对语言模型本身的理解。然而，许多已发表的关于语言模型的研究依赖于相对较小的闭源语言模型样本，限制了我们严格测试心理理论和评估语言模型能力的能力。在这里，我们通过评估41个开放权重模型（来自不同模型家族）的语言模型心理状态推理行为，复制并扩展了已发表的关于错误信念任务的研究。我们发现，34%的被测试语言模型对隐含知识状态表现出敏感性；然而，与之前的研究一致，没有一个模型能够完全“解释”人类的这一效应。较大的语言模型显示出更高的敏感性，并且表现出更高的心理测量预测能力。最后，我们利用语言模型的行为生成并测试了一个关于人类认知的新假设：当知识状态通过非事实动词（“约翰认为……”）提示时，人类和语言模型都表现出对归因错误信念的偏向，而不是通过间接提示（“约翰在看……”）。与知识状态的主要效应不同，人类的敏感性超过了语言模型，而人类知识提示效应的大小恰好落在语言模型效应大小的分布范围内——这表明，语言的分布统计原则上可以解释后者，但不能解释前者。这些结果展示了利用更大样本的开放权重语言模型来测试人类认知理论和评估语言模型能力的价值。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2602.16093

Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

通过上下文蒸馏更新参数知识以保留后训练能力

Padmanabhan, Shankar, Gul, Mustafa Omer, Goyal, Tanya

Abstract

Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.

Chinese Translation

后训练赋予预训练的大型语言模型（LLMs）多种理想技能，包括遵循指令、推理等。然而，这些后训练的LLMs仅编码截止日期之前的知识，因此需要持续适应。不幸的是，现有解决方案无法同时从适应文档语料库中学习新知识并减轻早期学习能力的遗忘。为了解决这个问题，我们提出了通过分割上下文的蒸馏（Distillation via Split Contexts, DiSC），这是一种基于上下文蒸馏的简单方法，用于持续知识适应。该方法通过对训练示例的不同段进行条件化，推导出学生和教师分布，并最小化共享标记之间的KL散度。这使我们能够高效地应用上下文蒸馏，而无需在训练过程中进行显式生成步骤。我们在四个后训练模型和两个适应领域进行了实验。与之前的微调和蒸馏方法相比，DiSC在学习新知识和减轻以前学习的技能（如遵循指令、推理和事实知识）遗忘之间始终报告出最佳的权衡。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2602.16144

Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

设计缺失：可证明的可撤销多模态情感分析中的模态删除

Fu, Rong, Zhang, Wenxin, Wang, Ziming, Meng, Chunlei, Lu, Jiaxuan, Wu, Jiekai, Qian, Kangan, Zhang, Hao, Fong, Simon

Abstract

As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.

Chinese Translation

随着多模态系统越来越多地处理敏感个人数据，有选择地撤销特定数据模态的能力已成为隐私合规和用户自主权的关键要求。我们提出了设计缺失（Missing-by-Design, MBD），这是一个统一的可撤销多模态情感分析框架，结合了结构化表示学习和可证明的参数修改管道。在隐私敏感的应用中，可撤销性至关重要，因为用户或监管机构可能会要求删除特定模态的信息。MBD学习属性感知的嵌入，并采用基于生成器的重建方法来恢复缺失的通道，同时保留与任务相关的信号。对于删除请求，该框架应用基于显著性的候选选择和校准的高斯更新，生成机器可验证的模态删除证书。在基准数据集上的实验表明，MBD在不完整输入下实现了强大的预测性能，并提供了实用的隐私-效用权衡，将外科式遗忘定位为全量重训练的高效替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2602.16154

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

通过多听众软执行平衡推理中的忠实性与性能

Sivakumaran, Nithin, Yu, Shoubin, Lee, Hyunji, Zhang, Yue, Payani, Ali, Bansal, Mohit, Stengel-Eskin, Elias

Abstract

Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.

Chinese Translation

链式思维（CoT）推理有时未能忠实地反映大型语言模型（LLM）的真实计算，妨碍了其在解释 LLM 如何得出答案方面的实用性。此外，在推理中优化忠实性和可解释性往往会降低任务性能。为了解决这一权衡并提高 CoT 的忠实性，我们提出了多听众推理执行（Reasoning Execution by Multiple Listeners，REMUL），这是一种多方强化学习方法。REMUL 基于这样的假设：其他方可以跟随的推理轨迹将更加忠实。发言者模型生成一个推理轨迹，该轨迹被截断并传递给一组听众模型，后者“执行”该轨迹，继续推理直到得出答案。发言者因产生对听众清晰的推理而获得奖励，同时通过掩蔽监督微调进行额外的正确性正则化，以应对忠实性与性能之间的权衡。在多个推理基准测试（BIG-Bench Extra Hard、MuSR、ZebraLogicBench 和 FOLIO）上，REMUL 一致且显著地提高了三项忠实性指标——提示归因、曲线下的早期回答区域（AOC）和错误注入 AOC，同时也提高了准确性。我们的分析发现，这些提升在训练领域中是稳健的，转化为可读性提升，并与更短且更直接的 CoTs 相关联。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2602.16162

LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

大型语言模型在创意写作中的不确定性显著低于专业作家

Sui, Peiqi

Abstract

We argue that uncertainty is a key and understudied limitation of LLMs' performance in creative writing, which is often characterized as trite and clich\'e-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.

Chinese Translation

我们认为，不确定性是大型语言模型（LLMs）在创意写作中表现的一个关键且未充分研究的限制，创意写作常常被描述为陈腐和充满陈词滥调。文学理论将不确定性视为创意表达的必要条件，而当前的对齐策略则引导模型避免不确定的输出，以确保事实准确性并减少幻觉。我们通过量化人类创作故事与模型生成续写之间的“不确定性差距”来形式化这一张力。通过对28个大型语言模型在高质量叙事数据集上的控制信息论分析，我们证明人类写作的整体不确定性显著高于模型输出。我们发现，与其基础版本相比，经过指令调优和推理的模型加剧了这一趋势；此外，在创意写作中，这一差距比在功能性领域更为明显，并且与写作质量有很强的相关性。实现人类水平的创造力需要新的不确定性感知对齐范式，以区分破坏性的幻觉和文学丰富性所需的建设性模糊性。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2602.16189

Beyond Learning: A Training-Free Alternative to Model Adaptation

超越学习：一种无训练的模型适应替代方案

Yoon, Namkyung, Yoo, Kyeonghyun, Jung, Wooyong, Kim, Sanghong, Kim, Hwangnam

Abstract

Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.

Chinese Translation

尽管语言模型的研究和发展持续进行，但它们有时表现不及之前的版本。现有的应对这些挑战的方法资源密集，突显了需要能够立即采取行动的替代方案。我们假设每个语言模型内部都有一个适合特定功能的局部模块。首先，本研究通过基于激活的分析识别出一组在推理工作负载下显示出一致且局部激活变化的模块。随后，我们将一个为特定任务适当激活的内部模块移植到目标模型中，从而在没有额外训练或微调的情况下，导致立即且可测量的功能变化。为了实验性地证明移植技术的有效性，我们量化了在不同条件下移植强度与性能提升之间的关系，针对两个语言模型进行研究。在跨代设置中，我们发现移植激活选择的模块可以显著改善表现不佳的模型，达到目标基线的两倍，并实现超过100%的基于差距的恢复。此外，在基础模型与其指令调优版本之间的移植实验中，移植使表现不佳的模型向更强的基线靠拢，在最佳情况下，达到了约2.33倍的目标基线，基于差距的恢复也达到了100%。这些结果表明，通过植入语言模型所暗示的高度局部化模块，可以实现有意义的能力转移。总体而言，本研究为语言模型中的任务局部化模块性提供了实证证据，并提出了一个新的研究领域：模型移植。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2602.16200

The Validity of Coreference-based Evaluations of Natural Language Understanding

基于共指的自然语言理解评估的有效性

Porada, Ian

Abstract

In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.

Chinese Translation

在本论文中，我通过扩展现有评估实践并考虑评估结果的趋同或冲突程度，进一步明确我们可以从基于共指的评估中得出什么结论。首先，我分析了标准的共指评估，并展示了其设计常常导致非可推广的结论，原因在于测量有效性的问题，包括争议性（共指的多重竞争定义）和趋同有效性（在不同基准上对模型的评估结果排名不同）。其次，我提出并实施了一种新的评估方法，重点测试系统推断事件相对可信度的能力，这是解决共指的一个关键方面。通过这一扩展评估，我发现当代语言模型在标准基准上表现出强劲的性能——在某些领域和共指类型上优于早期的基线系统——但仍对评估条件敏感：当评估上下文稍作修改时，它们常常无法以人类所能做到的方式进行推广。综上所述，这些发现澄清了当前自然语言处理范式的优点，如在广泛使用的评估中相较于基线的准确性提升，以及其局限性，包括测量有效性方面的弱点，并为未来在开发更好的评估方法和更具真正可推广性的系统方面提供了方向。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2602.16201

Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications

大型语言模型中的长尾知识：分类、机制、干预和影响

Badhe, Sanket, Shah, Deep, Kathrotia, Nehal

Abstract

Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.

Chinese Translation

大型语言模型（LLMs）是在具有陡峭幂律分布的网络规模语料库上训练的，其中知识的分布呈现高度的长尾特征，大多数知识出现频率较低。尽管模型规模的扩大提高了平均性能，但在低频、特定领域、文化和时间知识上的持续失败仍然缺乏充分的表征。本文构建了一个关于大型语言模型中长尾知识的结构化分类和分析，综合了技术和社会技术视角的先前研究。我们引入了一个结构化的分析框架，综合了四个互补的维度：长尾知识的定义、在训练和推理过程中知识丢失或扭曲的机制、为减轻这些失败而提出的技术干预，以及这些失败对公平性、问责制、透明度和用户信任的影响。我们进一步探讨了现有评估实践如何掩盖尾部行为，并使得对稀有但重要的失败的问责变得复杂。本文最后识别了与隐私、可持续性和治理相关的开放挑战，这些挑战限制了长尾知识的表征。总体而言，本文提供了一个统一的概念框架，以理解长尾知识在定义、丢失、评估和在部署的语言模型系统中表现的方式。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2602.16241

Are LLMs Ready to Replace Bangla Annotators?

大型语言模型准备好替代孟加拉语标注员了吗？

Hasan, Md. Najib, Hasan, Touseef, Sarkar, Souvika

Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作自动标注工具，以扩大数据集的创建规模，但它们作为无偏见标注员的可靠性——尤其是在低资源和身份敏感的环境中——仍然不够清楚。在本研究中，我们考察了LLMs作为零样本标注员在孟加拉语仇恨言论任务中的表现，这一任务即使对于人类标注员来说也很具挑战性，且标注员偏见可能带来严重的后果。我们使用统一的评估框架对17个LLMs进行了系统的基准测试。我们的分析揭示了标注员偏见和模型判断中的显著不稳定性。令人惊讶的是，模型规模的增加并不保证标注质量的提升——较小且更符合任务的模型往往表现出比其更大模型更一致的行为。这些结果突显了当前LLMs在低资源语言的敏感标注任务中的重要局限性，并强调了在部署前进行仔细评估的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2602.16290

Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation

Aladdin-FTI @ AMIYA 阿拉伯语自然语言处理的三大愿望：忠实性、双语现象与多方言生成

Mutal, Jonathan, Almaoui, Perla Al, Hengchen, Simon, Bouillon, Pierrette

Abstract

Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.

Chinese Translation

由于阿拉伯方言的非标准化和高度变异性，它们在自然语言处理（NLP）研究中长期以来一直被低估，这给计算建模带来了挑战。该领域的最新进展，例如大型语言模型（LLMs），为解决这一差距提供了有希望的途径，使阿拉伯语能够被建模为一种多中心语言，而不是单一的系统。本文介绍了Aladdin-FTI，这是我们提交给AMIYA共享任务的系统。该系统旨在生成和翻译方言阿拉伯语（DA）。具体而言，该模型支持摩洛哥、埃及、巴勒斯坦、叙利亚和沙特方言的文本生成，以及这些方言与现代标准阿拉伯语（MSA）和英语之间的双向翻译。代码和训练模型已公开提供。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2602.16298

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

MultiCW：用于训练鲁棒性查证价值检测模型的大规模平衡基准数据集

Hyben, Martin, Kula, Sebastian, Cegin, Jan, Simko, Jakub, Srba, Ivan, Moro, Robert

Abstract

Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

Chinese Translation

大型语言模型（LLMs）正在重新塑造媒体专业人士验证信息的方式，但在事实核查过程中，自动化支持检测查证价值声明的能力仍然有限。我们介绍了Multi-Check-Worthy（MultiCW）数据集，这是一个平衡的多语言基准数据集，用于查证价值声明的检测，涵盖16种语言、7个主题领域和2种写作风格。该数据集包含123,722个样本，噪声（非正式）文本和结构化（正式）文本均匀分布，并在所有语言中平衡地代表查证价值和非查证价值类别。为了探究模型的鲁棒性，我们还引入了一个同样平衡的超出分布评估集，包含27,761个样本，涵盖4种额外语言。为了提供基准，我们在零样本设置下，对3个常见的微调多语言变换器与15个商业和开放的LLMs进行了基准测试。我们的研究结果表明，微调模型在声明分类上始终优于零样本LLMs，并在不同语言、领域和风格之间展现出强大的超出分布泛化能力。MultiCW为推进自动化事实核查提供了一个严格的多语言资源，并使微调模型与前沿LLMs在查证价值声明检测任务上的系统比较成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2602.16313

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

MemoryArena：在相互依赖的多会话代理任务中评估代理记忆

He, Zexue, Wang, Yu, Zhi, Churan, Hu, Yuanzhe, Chen, Tzu-Ping, Yin, Lang, Chen, Ze, Wu, Tong Arthur, Ouyang, Siru, Wang, Zihan, Pei, Jiaxin, McAuley, Julian, Choi, Yejin, Pentland, Alex

Abstract

Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.

Chinese Translation

现有的代理记忆评估通常孤立地评估记忆和行动。一类基准通过测试对过去对话或文本的回忆来评估记忆，但未能捕捉记忆如何用于指导未来决策。另一类基准则专注于在单会话任务中行动的代理，而不需要长期记忆。然而，在现实环境中，记忆和行动是紧密相连的：代理在与环境互动时获取记忆，并随后依赖这些记忆来解决未来的任务。为了捕捉这种情境，我们引入了MemoryArena，一个统一的评估平台，用于在多会话记忆-代理-环境循环中基准测试代理记忆。该基准由人类设计的代理任务组成，这些任务具有明确相互依赖的子任务，代理必须通过将经验提炼到记忆中，从早期的行动和反馈中学习，并随后利用这些记忆指导后续行动以解决整体任务。MemoryArena支持在网页导航、偏好约束规划、渐进信息搜索和顺序形式推理等方面的评估，并揭示了在现有长上下文记忆基准（如LoCoMo）上表现接近饱和的代理在我们的代理设置中表现不佳，暴露了当前对具有记忆的代理评估中的差距。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2602.16346

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

有益到失控：测量多轮、多语言大型语言模型代理中的非法协助

Talokar, Nivya, Tarun, Ayush K, Mandal, Murari, Andriushchenko, Maksym, Bosselut, Antoine

Abstract

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Chinese Translation

基于大型语言模型（LLM）的代理通过工具和记忆执行现实世界的工作流程。这些功能使得恶意对手也能够利用这些代理执行复杂的滥用场景。现有的代理滥用基准主要测试单次提示指令，未能有效测量代理在多轮交互中如何最终帮助进行有害或非法任务。我们引入了STING（非法N步目标执行的顺序测试），这是一个自动化的红队框架，构建基于良性角色的逐步非法计划，并通过适应性后续提问迭代探测目标代理，使用评判代理来跟踪阶段完成情况。我们进一步引入一个分析框架，将多轮红队建模为首次突破的时间随机变量，使得分析工具如发现曲线、按攻击语言的风险比归因以及一个新的指标：限制平均突破发现成为可能。在AgentHarm场景中，STING在非法任务完成率上显著高于单轮提示和适应于工具使用代理的聊天导向多轮基线。在六个非英语环境的多语言评估中，我们发现攻击成功率和非法任务完成率在低资源语言中并不一致增加，这与常见的聊天机器人研究结果有所不同。总体而言，STING为评估和压力测试代理滥用提供了一种实用的方法，适用于现实部署环境，其中交互本质上是多轮的，并且通常是多语言的。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2602.16379

Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

基于标签一致的数据生成方法用于基于方面的情感分析，采用LLM代理

Monfared, Mohammad H. A., Flek, Lucie, Karimi, Akbar

Abstract

We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.

Chinese Translation

我们提出了一种用于基于方面的情感分析（ABSA）的代理数据增强方法，该方法通过迭代生成和验证来产生高质量的合成训练样本。为了隔离代理结构的影响，我们还开发了一种紧密匹配的基于提示的基线，使用相同的模型和指令。两种方法在三个ABSA子任务（方面术语提取（ATE）、方面情感分类（ATSC）和方面情感对提取（ASPE））、四个SemEval数据集以及两个编码-解码模型（T5-Base和Tk-Instruct）上进行了评估。我们的结果表明，代理增强在增强数据的标签保留方面优于原始提示，特别是在任务需要生成方面术语时。此外，当与真实数据结合时，代理增强提供了更高的收益，始终优于基于提示的生成。这些优势在T5-Base上最为明显，而经过更大规模预训练的Tk-Instruct则表现出较小的改进。因此，增强数据帮助T5-Base实现了与其对应模型相当的性能。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2602.16429

TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

TabAgent：一种用表格-文本分类器替代代理生成组件的框架

Levy, Ido, Shapira, Eilam, Goldshtein, Yinon, Yaeli, Avi, Mashkif, Nir, Shlomov, Segev

Abstract

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.

Chinese Translation

代理系统是指能够自主执行多步骤工作流程以实现复杂目标的人工智能架构，通常通过对封闭集决策任务（如路由、候选人筛选、门控和验证）进行多次大型语言模型（LLM）调用来构建。尽管这种设计方便，但由于累积延迟和令牌使用，导致部署速度慢且成本高。我们提出了TabAgent，一个框架，用于在封闭集选择任务中用基于执行轨迹训练的紧凑型文本-表格分类器替代生成决策组件。TabAgent (i) 从轨迹中提取结构化模式、状态和依赖特征（TabSchema），(ii) 通过与模式对齐的合成监督增强覆盖范围（TabSynth），以及 (iii) 使用轻量级分类器对候选人进行评分（TabHead）。在长期的AppWorld基准测试中，TabAgent在消除候选人筛选时的LLM调用的同时保持了任务级成功，延迟减少约95%，推理成本降低85-91%。除了工具候选人筛选，TabAgent还可以推广到其他代理决策头，建立了一种在生产代理架构中学习到的判别替代生成瓶颈的范式。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2602.16467

IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

IndicEval：一个针对大型语言模型的双语印度教育评估框架

Bharti, Saurabh, Azad, Gaurav, Jagtap, Abhinaw, Tapas, Nachiket

Abstract

The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.

Chinese Translation

大型语言模型（LLMs）的快速发展需要反映现实学术严谨性和多语言复杂性的评估框架。本文介绍了IndicEval，一个可扩展的基准测试平台，旨在使用来自UPSC、JEE和NEET的真实高风险考试问题评估LLM的表现，涵盖STEM和人文学科领域，支持英语和印地语。与合成基准不同，IndicEval将评估建立在真实考试标准之上，能够实现推理、领域知识和双语适应性的真实测量。该框架使用零样本（Zero-Shot）、少样本（Few-Shot）和思维链（Chain-of-Thought, CoT）提示策略自动化评估，并支持新模型和语言的模块化集成。在对Gemini 2.0 Flash、GPT-4、Claude和LLaMA 3-70B进行的实验中，发现了三个主要结果。首先，CoT提示始终提高推理准确性，在各学科和语言中均有显著提升。其次，跨模型性能差异显著存在，特别是在高复杂度的考试中。第三，多语言退化仍然是一个关键挑战，尤其是在零样本条件下，印地语的准确性明显低于英语。这些结果突显了双语推理和领域转移中的持续差距。总体而言，IndicEval为在多语言教育环境中进行严格、公平的LLM评估提供了一个以实践为导向的可扩展基础，并为提高推理的稳健性和语言适应性提供了可行的见解。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2602.16469

Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

在翻译体方言上训练模型显示词汇多样性和源-目标句法相似性如何影响学习

Kunz, Jenny

Abstract

Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.

Chinese Translation

机器翻译数据在多语言自然语言处理（NLP）中被广泛使用，尤其是在母语文本稀缺的情况下。然而，翻译文本与母语文本在系统上存在差异。这种现象被称为翻译体（translationese），它反映了源语言的痕迹以及翻译本身的特征属性。在本文中，我们研究了在机器翻译数据上训练对小型英语语言模型的影响，重点关注来自不同源语言的翻译体如何影响语言可接受性判断和不同领域的语言建模。我们在从24种类型学和资源多样性的源语言翻译的英语文本上训练模型，从而能够系统分析源语言和语料库特性如何影响模型的学习。我们的结果表明，源语言对模型行为有明显影响：一般困惑度更多地受到翻译语料库词汇多样性的驱动，而语法表现则与英语的类型学相似性强相关，前提是有足够的数据。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2602.16485

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

思维团队：通过协调工具调用实现智能系统的高效测试时扩展

Wong, Jeffrey T. H., Zhang, Zixi, Liu, Junyi, Zhao, Yiren

Abstract

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

Chinese Translation

现有的多智能体系统（MAS）通常依赖于静态的、同质的模型配置，这限制了它们利用不同后训练模型的独特优势的能力。为了解决这一问题，我们提出了思维团队（Team-of-Thoughts），一种新颖的多智能体系统架构，通过协调者-工具范式利用异构智能体的互补能力。我们的框架引入了两个关键机制以优化性能：（1）协调者校准方案，识别具有优越协调能力的模型；（2）自我评估协议，工具智能体对其自身领域专业知识进行分析，以考虑后训练技能的变化。在推理过程中，协调者根据这些能力档案动态激活最合适的工具智能体。在五个推理和代码生成基准上的实验表明，思维团队在任务性能上始终表现优越。值得注意的是，在AIME24和LiveCodeBench上，我们的方法分别达到了96.67%和72.53%的准确率，显著优于同质角色扮演基线，它们的得分为80%和65.93%。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2602.16488

Learning to Learn from Language Feedback with Social Meta-Learning

通过社会元学习从语言反馈中学习

Cook, Jonathan, Antognini, Diego, Klissarov, Martin, Musat, Claudiu, Grefenstette, Edward

Abstract

Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.

Chinese Translation

大型语言模型（LLMs）在对话环境中往往难以从纠正反馈中学习。即使在面对模糊性时，它们也很少主动寻求这种反馈，这使得它们的对话显得静态、单向，缺乏人类对话的适应性特征。为了解决这些局限性，我们从人类的社会元学习（Social Meta-Learning, SML）中获得灵感——即从他人那里学习如何学习的过程。我们将SML构建为一种微调方法，训练LLMs在模拟的教学对话中寻求并学习语言反馈，将静态任务转化为互动的社会学习问题。SML有效地教会模型利用对话来解决它们在单次对话中无法解决的问题。这种能力在不同领域之间具有普遍性；在数学问题上的SML训练使得模型更好地利用反馈来解决编码问题，反之亦然。此外，尽管这些模型仅在完全指定的问题上进行训练，但它们在解决信息在多个回合中逐步揭示的未指定任务时表现得更好。在面对这种模糊性时，经过SML训练的模型较少做出过早的回答尝试，更有可能询问所需的信息。这项工作提出了一种可扩展的方法，以开发能够有效从语言反馈中学习的人工智能系统。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2602.16490

From Growing to Looping: A Unified View of Iterative Computation in LLMs

从增长到循环：大规模语言模型中迭代计算的统一视角

Kapl, Ferdinand, Angelis, Emmanouil, Maile, Kaitlin, von Oswald, Johannes, Bauer, Stefan

Abstract

Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

Chinese Translation

循环，即在深度上重用一组层，以及深度增长，通过复制中间层训练浅到深的模型，这两者都与更强的推理能力相关，但它们之间的关系仍不清晰。我们提供了一种机制上的统一：循环和深度增长的模型表现出收敛的深度特征，包括对后层的依赖增加以及与循环或增长块对齐的重复模式。这些共享特征支持了它们的增益源于一种共同的迭代计算形式的观点。在这一联系的基础上，我们展示了这两种技术是可适应和可组合的：在深度增长模型的中间块上应用推理时循环，可以使某些推理原语的准确性提高多达 $2 imes$，尽管该模型从未经过循环训练。当提供更多的上下文示例或额外的监督微调数据时，这两种方法的适应性也优于基线。此外，深度增长模型在使用更高质量、数学密集的冷却混合物时获得了最大的推理增益，而通过使中间块适应循环可以进一步提升这一效果。总体而言，我们的结果将深度增长和循环定位为互补的、实用的方法，以引导和扩展迭代计算，从而改善推理能力。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2602.16500

Optimizing Soft Prompt Tuning via Structural Evolution

通过结构演化优化软提示调优

Huang, Zhenzhen, Zhang, Chaoning, Bian, Haoyu, Zhang, Songbo, Tai, Chi-lok Andy, Zhang, Jiaquan, Qin, Caiyan, Qu, Jingjing, Ye, Yalan, Yang, Yang, Shen, Heng Tao

Abstract

Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.

Chinese Translation

软提示调优利用连续嵌入来捕捉大型预训练语言模型（LLMs）中的任务特定信息，在少量样本设置中实现了竞争性能。然而，软提示依赖于高维隐式表示，缺乏明确的语义和可追溯的训练行为，这限制了它们的可解释性。为了解决这一局限性，我们提出了一种基于拓扑形态演化的软提示调优优化方法。具体而言，我们采用拓扑数据分析（TDA）中的持久同调来量化软提示在连续参数空间中的结构表示及其训练过程的演变。定量分析表明，拓扑上稳定且紧凑的软提示能够实现更好的下游性能。基于这一经验观察，我们构建了一个用于优化软提示调优的损失函数，称为拓扑软提示损失（Topological Soft Prompt Loss，TSLoss）。TSLoss通过量化参数间的连接性和冗余性，引导模型学习结构上稳定的适应性。大量实验表明，使用TSLoss进行训练加速了收敛并提高了调优性能，为从结构和拓扑角度理解和优化软提示调优提供了一种可解释的方法。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2602.16516

Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

增强议程设置研究：28个欧洲议会的ParlaCAP数据集及可扩展的多语言基于LLM的分类

Pungeršek, Taja Kuzman, Rupnik, Peter, Širinić, Daniela, Ljubešić, Nikola

Abstract

This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

Chinese Translation

本文介绍了ParlaCAP，一个用于分析欧洲议会议程设置的大规模数据集，并提出了一种经济高效的构建领域特定政策主题分类器的方法。我们将比较议程项目（Comparative Agendas Project, CAP）框架应用于来自28个欧洲国家和自治区的超过800万篇演讲的多语言ParlaMint语料库，采用教师-学生框架，其中高性能的大语言模型（Large Language Model, LLM）对领域内训练数据进行注释，多语言编码器模型在这些注释上进行微调，以实现可扩展的数据注释。我们展示了这种方法能够生成针对目标领域的分类器。LLM与人工注释者之间的协议可与人类之间的注释者协议相媲美，所生成的模型在性能上优于基于手动注释但不在领域内数据训练的现有CAP分类器。除了CAP注释外，ParlaCAP数据集还提供丰富的发言者和政党元数据，以及来自ParlaSent多语言变换器模型的情感预测，支持对各国政治关注和代表性的比较研究。我们通过三个案例展示了该数据集的分析潜力，考察了议会对政策主题的关注分布、议会演讲中的情感模式，以及政策关注中的性别差异。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2602.16571

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

数学辅导中的实用性保护去标识化：研究 MathEd-PII 基准数据集中数字模糊性

Zhou, Zhuqian, Vanacore, Kirk, Ahtisham, Bakhtawar, Lee, Jinsook, Pietrzak, Doug, Hedley, Daryl, Dias, Jorge, Shaw, Chris, Schäfer, Ruth, Kizilcec, René F.

Abstract

Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

Chinese Translation

大规模共享基于对话的数据对于推动教学与学习科学的发展至关重要，但严格的去标识化仍然是一个主要障碍。在数学辅导的转录文本中，数字表达式经常类似于结构化标识符（例如，日期或 ID），这导致通用的个人可识别信息（PII）检测系统过度删除核心教学内容，从而降低数据集的实用性。本研究探讨如何在数学辅导转录文本中检测 PII，同时保留其教育实用性。为了解决这一挑战，我们研究了“数字模糊性”问题，并引入了 MathEd-PII，这是第一个用于数学辅导对话中 PII 检测的基准数据集，通过人机协作的 LLM 工作流创建，该工作流审计上游的删除并生成保护隐私的替代品。该数据集包含 1,000 次辅导会话（115,620 条消息；769,628 个标记），并附有经过验证的 PII 注释。使用基于密度的分割方法，我们表明虚假 PII 删除在数学密集区域中不成比例地集中，确认数字模糊性是一个关键的失败模式。随后，我们比较了四种检测策略：Presidio 基线和基于 LLM 的方法，包括基本提示、数学感知提示和分段感知提示。数学感知提示显著提高了性能（F1: 0.821 对比 0.379），同时减少了数字虚假阳性，证明去标识化必须结合领域上下文以保留分析实用性。本研究提供了一个新的基准，并证明了保护实用性的去标识化对于辅导数据需要领域感知建模。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2602.16607

CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes

CitiLink-Summ：欧洲葡萄牙市政会议记录中讨论主题的摘要

Marques, Miguel, Fernandes, Ana Luísa, Pacheco, Ana Filipa, Rebouças, Rute, Cantante, Inês, Isidro, José, Cunha, Luís Filipe, Jorge, Alípio, Guimarães, Nuno, Nunes, Sérgio, Leal, António, Silvano, Purificação, Campos, Ricardo

Abstract

Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.

Chinese Translation

市政会议记录是记录地方政府讨论和决策的正式文件，但其内容往往冗长、密集，且难以让公民理解。自动摘要技术可以通过为每个讨论主题生成简明摘要来帮助解决这一挑战。尽管其潜力巨大，但在市政会议记录中总结讨论主题的研究仍然在很大程度上未被探索，尤其是在低资源语言中，这些文件的固有复杂性带来了更多挑战。一个主要瓶颈是缺乏包含高质量、手工制作摘要的数据集，这限制了该领域有效摘要模型的开发和评估。本文提出了CitiLink-Summ，这是一个新的欧洲葡萄牙市政会议记录语料库，包含100份文件和2,322个手工撰写的摘要，每个摘要对应一个独特的讨论主题。利用该数据集，我们为该领域的自动摘要建立了基准结果，采用了最先进的生成模型（如BART、PRIMERA）以及大型语言模型（LLMs），并使用ROUGE、BLEU、METEOR和BERTScore等词汇和语义指标进行评估。CitiLink-Summ为欧洲葡萄牙的市政领域摘要提供了首个基准，为推动复杂行政文本的自然语言处理研究提供了宝贵资源。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2602.16608

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能：上下文感知的逐层集成梯度用于解释变换器模型

Mersha, Melkamu Abay, Kalita, Jugal

Abstract

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

Chinese Translation

变换器模型在各个领域和任务中实现了最先进的性能，但其深层表示使得预测结果难以解释。现有的可解释性方法依赖于最终层的归因，捕捉局部的标记级归因或全局的注意力模式而未能统一，且缺乏对标记间依赖关系和结构组件的上下文感知。它们还未能捕捉相关性如何在各层之间演变，以及结构组件如何影响决策。为了解决这些局限性，我们提出了 extbf{上下文感知逐层集成梯度框架（CA-LIG Framework）}，这是一个统一的分层归因框架，它在每个变换器块内计算逐层集成梯度，并将这些标记级归因与特定类别的注意力梯度融合。这种整合产生了有符号的、上下文敏感的归因图，捕捉支持性和对立证据，同时追踪变换器层之间相关性的分层流动。我们在多种任务、领域和变换器模型系列上评估CA-LIG框架，包括使用BERT进行情感分析和长文档、多类文档分类，使用XLM-R和AfroLM在低资源语言环境中进行仇恨言论检测，以及使用Masked Autoencoder视觉变换器模型进行图像分类。在所有任务和架构中，CA-LIG提供了更真实的归因，表现出对上下文依赖关系更强的敏感性，并生成比现有可解释性方法更清晰、更语义连贯的可视化结果。这些结果表明，CA-LIG提供了对变换器决策过程更全面、上下文感知和可靠的解释，推动了深度神经模型的实际可解释性和概念理解。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2602.16609

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

ColBERT-Zero：是预训练还是不预训练ColBERT模型

Chaffin, Antoine, Arnaboldi, Luca, Chatelain, Amélie, Krzakala, Florent

Abstract

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

Chinese Translation

当前最先进的多向量模型是通过在强大的单向量模型基础上进行小规模知识蒸馏（Knowledge Distillation, KD）训练步骤获得的，利用了这些模型的大规模预训练。在本文中，我们研究了多向量模型的预训练，并表明大规模的多向量预训练能够产生更强的多向量模型。值得注意的是，完全在公共数据上预训练的ColBERT模型ColBERT-Zero的表现超越了GTE-ModernColBERT及其基础模型GTE-ModernBERT，后者利用了封闭且更强的数据，为该规模的模型设定了新的最先进水平。我们还发现，尽管仅进行小规模的KD步骤不足以实现接近全面预训练的结果，但在此之前添加一个监督步骤可以在跳过最耗时的无监督阶段的同时，实现更接近的性能。最后，我们发现，在重新利用现有模型时，调整微调和预训练设置至关重要。为了便于探索我们的结果，我们发布了各种检查点以及用于训练它们的代码。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2602.16610

Who can we trust? LLM-as-a-jury for Comparative Assessment

我们可以信任谁？将大型语言模型作为比较评估的陪审团

Qian, Mengjie, Sun, Guangzhi, Gales, Mark J. F., Knill, Kate M.

Abstract

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

Chinese Translation

大型语言模型（LLMs）越来越多地被应用于自然语言生成评估的自动评估器，通常使用成对比较判断。现有的方法通常依赖于单一评审者或聚合多个评审者，假设其可靠性相同。然而，在实际应用中，LLM评审者在不同任务和方面的表现差异显著，其判断概率可能存在偏差和不一致。此外，可能无法获得用于评审者校准的人类标注监督。我们首先通过实证研究展示了LLM比较概率中的不一致性，并指出这限制了基于直接概率的排名效果。为了解决这一问题，我们研究了将LLM作为陪审团的设置，并提出了BT-sigma，这是一种考虑评审者的Bradley-Terry模型扩展，针对每个评审者引入了一个判别参数，以仅通过成对比较共同推断项目排名和评审者可靠性。在基准自然语言生成评估数据集上的实验表明，BT-sigma始终优于基于平均的聚合方法，并且所学习的判别器与LLM判断的循环一致性的独立测量具有强相关性。进一步分析表明，BT-sigma可以被解释为一种无监督的校准机制，通过建模评审者的可靠性来改善聚合效果。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2602.16639

AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

AREG：用于评估大型语言模型中的说服与抵抗的对抗资源提取游戏

Sakhawat, Adib, Sadab, Fardeen

Abstract

Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($\rho = 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.

Chinese Translation

评估大型语言模型（LLMs）的社会智能日益需要超越静态文本生成，转向动态的对抗性互动。我们提出了对抗资源提取游戏（AREG），这是一个将说服与抵抗操作化为对金融资源的多轮零和谈判的基准。通过在前沿模型之间进行循环赛，AREG 使得在单一互动框架内对进攻性（说服）和防御性（抵抗）能力进行联合评估。我们的分析提供了证据表明这些能力之间的相关性较弱（$ ho = 0.33$），并且在经验上是分离的：强大的说服表现并不可靠地预测强大的抵抗能力，反之亦然。在所有评估的模型中，抵抗得分超过说服得分，表明在对抗性对话环境中存在系统性的防御优势。进一步的语言分析表明，互动结构在这些结果中发挥了核心作用。寻求增量承诺的策略与更高的提取成功率相关，而寻求验证的回应在成功的防御中比明确拒绝更为普遍。综合来看，这些发现表明，LLMs中的社会影响并不是单一的能力，专注于说服的评估框架可能会忽视不对称的行为脆弱性。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2602.16640

Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Quecto-V1：针对设备端法律检索的8位量化小型语言模型的实证分析

Dikshit, Subrit

Abstract

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.

Chinese Translation

大型语言模型（LLMs）的快速普及彻底改变了自然语言处理（NLP）领域，但同时也造成了“资源鸿沟”。最先进的法律智能系统通常依赖于庞大的参数数量（7B+）和基于云的推理，这使得资源受限环境中的从业者无法使用，并带来了显著的数据主权风险。本文介绍了Quecto-V1，这是一种特定领域的小型语言模型（SLM），旨在使印度法律智能的获取更加民主化。Quecto-V1基于定制配置的GPT-2架构（1.24亿参数）从零开始训练，专门使用印度法典的语料库，包括印度刑法（IPC）、刑事诉讼法（CrPC）和印度宪法。与优先考虑广泛世界知识的通用模型不同，我们的方法在法律领域内最大化“词汇密度”。此外，我们通过应用后训练8位量化（GGUF格式）来解决部署瓶颈，将模型压缩到低于150 MB的内存占用。我们的实证分析表明，Quecto-V1在检索法定定义和刑事条款方面实现了高保真度，在特定领域的精确匹配任务中优于通用小型语言模型，并且完全在消费级CPU上离线运行。我们还进行了消融研究，显示8位量化使模型大小减少了74%，而检索准确性仅比全精度基线降低了不到3.5%。这些发现表明，对于法律等专业高风险领域，特定领域的训练结合激进的量化提供了一种可行的、保护隐私的替代方案，优于单一的云模型。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2602.16660

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

一次对齐，多语言受益：强化大语言模型的多语言安全对齐

Bu, Yuyan, Liu, Xiaohao, Ren, ZhaoXing, Yang, Yaodong, Dai, Juntao

Abstract

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.

Chinese Translation

大型语言模型（LLMs）在语言社区中的广泛部署需要可靠的多语言安全对齐。然而，最近将对齐扩展到其他语言的努力通常需要大量资源，要么通过在目标语言中进行大规模、高质量的监督，要么通过与高资源语言的成对对齐，这限制了可扩展性。在本研究中，我们提出了一种资源高效的多语言安全对齐改进方法。我们引入了一种可插拔的多语言一致性（Multi-Lingual Consistency, MLC）损失，可以集成到现有的单语言对齐管道中。通过改善多语言表示向量之间的共线性，我们的方法在单次更新中鼓励多语言语义层面的方向一致性。这使得在多个语言之间的同时对齐成为可能，仅需使用多语言提示变体，而无需在低资源语言中进行额外的响应级监督。我们在不同的模型架构和对齐范式下验证了所提出的方法，并展示了其在增强多语言安全性方面的有效性，同时对模型的整体效用影响有限。进一步在不同语言和任务上的评估表明，跨语言泛化能力有所提升，建议所提出的方法作为在有限监督下实现多语言一致性对齐的实用解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2602.16699

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

校准后行动：成本意识的 LLM 代理探索

Ding, Wenxuan, Tomlin, Nicholas, Durrett, Greg

Abstract

LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

Chinese Translation

大型语言模型（LLMs）越来越多地被用于解决复杂问题，这些问题不一定能通过单一响应解决，而需要与环境互动以获取信息。在这些场景中，LLMs 必须考虑何时停止探索并承诺一个答案时固有的成本-不确定性权衡。例如，在编程任务中，如果 LLM 对生成的代码片段的正确性存在不确定性，它应该测试该代码片段；编写测试的成本是非零的，但通常低于犯错的成本。在本研究中，我们展示了如何诱导 LLMs 明确地推理这些成本-不确定性权衡的平衡，从而进行更优的环境探索。我们将多个任务，包括信息检索和编码，形式化为不确定性下的序贯决策问题。每个问题都有潜在的环境状态，可以通过传递给 LLM 代理的先验进行推理。我们引入了一个名为校准后行动（Calibrate-Then-Act, CTA）的框架，在该框架中，我们为 LLM 提供额外的上下文，以使其能够更优化地行动。即使在基线和 CTA 的强化学习训练下，这种改进仍然得以保留。我们在信息寻求问答和简化编码任务上的结果表明，通过 CTA 明确进行成本-收益权衡可以帮助代理发现更优的决策策略。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2602.16704

Reinforced Fast Weights with Next-Sequence Prediction

强化快速权重与下一个序列预测

Hwang, Hee Seung, Wu, Xindi, Chun, Sanghyuk, Russakovsky, Olga

Abstract

Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

Chinese Translation

快速权重架构为长上下文建模提供了一种有前景的替代方案，因为它们在上下文长度变化时保持恒定的内存开销。然而，它们的潜力受到下一个标记预测（NTP）训练范式的限制。NTP优化单个标记的预测，而忽略了前缀后多个标记之间的语义连贯性。因此，快速权重模型动态更新其参数以存储上下文信息，但学习到的表示往往是次优的，无法捕捉长程依赖关系。我们提出了REFINE（Reinforced Fast weIghts with Next sEquence prediction），这是一个强化学习框架，旨在根据下一个序列预测（NSP）目标训练快速权重模型。REFINE根据预测熵选择信息丰富的标记位置，生成多标记的回滚，分配自监督的序列级奖励，并通过组相对策略优化（GRPO）优化模型。REFINE适用于预训练语言模型的整个训练生命周期：中期训练、后期训练和测试时训练。我们在LaCT-760M和DeltaNet-1.3B上的实验表明，REFINE在“干草堆中的针”检索、长上下文问答以及LongBench中的多样化任务上，始终优于使用NTP的监督微调。REFINE为改善快速权重架构中的长上下文建模提供了一个有效且多功能的框架。

View on arXiv Download PDF AI Translation

arXiv Papers

EdgeNav-QE: QLoRA Quantization and Dynamic Early Exit for LAM-based Navigation on Edge Devices

From Conflicts to Collisions: A Two-Stage Collision Scenario-Testing Approach for Autonomous Driving Systems

TurboADMM: A Structure-Exploiting Parallel Solver for Multi-Agent Trajectory Optimization

A Decade of Human-Robot Interaction through Immersive Lenses: A Literature Review on Extended Reality as a Research Instrument in Social Robotics

ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Test-Time Adaptation for Tactile-Vision-Language Models

Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

The SLAM Confidence Trap

A novel Integrated Motion Tracking Device (IMTD) for Objective Laparoscopic Training Assessment: Development and Validation

Optimization of an Augmented R-CUBE mechanism for Cervical Surgery

Learning to Drive in New Cities Without Human Demonstrations

Statistical-Geometric Degeneracy in UAV Search: A Physics-Aware Asymmetric Filtering Approach

VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

Adaptive Illumination Control for Robot Perception

Coverage Path Planning for Autonomous Sailboats in Inhomogeneous and Time-Varying Oceans: A Spatiotemporal Optimization Approach

World Action Models are Zero-shot Policies

Hybrid Model Predictive Control with Physics-Informed Neural Network for Satellite Attitude Control

The human intention. A taxonomy attempt and its applications to robotics

ODYN: An All-Shifted Non-Interior-Point Method for Quadratic Programming in Robotics and AI

The Impact of Class Uncertainty Propagation in Perception-Based Motion Planning

ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios

Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization

Image Measurement Method for Automatic Insertion of Forks into Inclined Pallet

World Model Failure Classification and Anomaly Detection for Autonomous Inspection

SIT-LMPC: Safe Information-Theoretic Learning Model Predictive Control for Iterative Tasks

Nonplanar Model Predictive Control for Autonomous Vehicles with Recursive Sparse Gaussian Process Dynamics

Markerless Robot Detection and 6D Pose Estimation for Multi-Agent SLAM

Machine Learning Driven Prediction of the Behavior of Biohybrid Actuators

Dual-Quadruped Collaborative Transportation in Narrow Environments via Safe Reinforcement Learning

Articulated 3D Scene Graphs for Open-World Mobile Manipulation

System Identification under Constraints and Disturbance: A Bayesian Estimation Approach

Docking and Persistent Operations for a Resident Underwater Vehicle

Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

Dynamic Modeling and MPC for Locomotion of Tendon-Driven Soft Quadruped

RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation

Reactive Motion Generation With Particle-Based Perception in Dynamic Environments

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

Decentralized and Fully Onboard: Range-Aided Cooperative Localization and Navigation on Micro Aerial Vehicles

Sensor Query Schedule and Sensor Noise Covariances for Accuracy-constrained Trajectory Estimation

Towards Autonomous Robotic Kidney Ultrasound: Spatial-Efficient Volumetric Imaging via Template Guided Optimal Pivoting

Learning to unfold cloth: Scaling up world models to deformable object manipulation

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation

Egocentric Bias in Vision-Language Models

Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

A Study on Real-time Object Detection using Deep Learning

Visual Memory Injection Attacks for Multi-Turn Conversations

Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning

LAND: A Longitudinal Analysis of Neuromorphic Datasets

SAM 3D Body: Robust Full-Body Human Mesh Recovery

BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

CHAI: CacHe Attention Inference for text2video

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling

EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection

HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images

A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

Subtractive Modulative Network with Learnable Periodic Activations

SCAR: Satellite Imagery-Based Calibration for Aerial Recordings

Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing