← Back to Index
Daily Research Digest

arXiv Papers

2026-02-20
188
Papers
4
Categories
188
Translated
收藏清单 0
机器人学 (Robotics)
28
cs.RO / 1 / 2602.16744

ICP-Based Pallet Tracking for Unloading on Inclined Surfaces by Autonomous Forklifts

基于ICP的自主叉车在倾斜表面卸货的托盘跟踪方法
Kato, Takuro, Morisawa, Mitsuharu
Abstract
This paper proposes a control method for autonomous forklifts to unload pallets on inclined surfaces, enabling the fork to be withdrawn without dragging the pallets. The proposed method applies the Iterative Closest Point (ICP) algorithm to point clouds measured from the upper region of the pallet and thereby tracks the relative position and attitude angle difference between the pallet and the fork during the unloading operation in real-time. According to the tracking result, the fork is aligned parallel to the target surface. After the fork is aligned, it is possible to complete the unloading process by withdrawing the fork along the tilt, preventing any dragging of the pallet. The effectiveness of the proposed method is verified through dynamic simulations and experiments using a real forklift that replicate unloading operations onto the inclined bed of a truck.
Chinese Translation
本文提出了一种自主叉车在倾斜表面卸货的控制方法,使叉子能够在不拖动托盘的情况下撤回。该方法应用了迭代最近点(Iterative Closest Point, ICP)算法,对从托盘上部区域测得的点云进行处理,从而实时跟踪卸货过程中托盘与叉子之间的相对位置和姿态角度差异。根据跟踪结果,叉子与目标表面平行对齐。在叉子对齐后,可以沿着倾斜方向撤回叉子,从而完成卸货过程,避免托盘被拖动。通过动态仿真和使用真实叉车进行的实验验证了该方法的有效性,这些实验模拟了在卡车倾斜床上进行的卸货操作。
cs.RO / 2 / 2602.16758

Smooth trajectory generation and hybrid B-splines-Quaternions based tool path interpolation for a 3T1R parallel kinematic milling robot

平行运动学铣削机器人平滑轨迹生成与基于混合B样条-四元数的工具路径插值
Akhbari, Sina, Mahboubkhah, Mehran
Abstract
This paper presents a smooth trajectory generation method for a four-degree-of-freedom parallel kinematic milling robot. The proposed approach integrates B-spline and Quaternion interpolation techniques to manage decoupled position and orientation data points. The synchronization of orientation and arc-length-parameterized position data is achieved through the fitting of smooth piece-wise Bezier curves, which describe the non-linear relationship between path length and tool orientation, solved via sequential quadratic programming. By leveraging the convex hull properties of Bezier curves, the method ensures spatial and temporal separation constraints for multi-agent trajectory generation. Unit quaternions are employed for orientation interpolation, providing a robust and efficient representation that avoids gimbal lock and facilitates smooth, continuous rotation. Modifier polynomials are used for position interpolation. Temporal trajectories are optimized using minimum jerk, time-optimal piece-wise Bezier curves in two stages: task space followed by joint space, implemented on a low-cost microcontroller. Experimental results demonstrate that the proposed method offers enhanced accuracy, reduced velocity fluctuations, and computational efficiency compared to conventional interpolation methods.
Chinese Translation
本文提出了一种针对四自由度平行运动学铣削机器人的平滑轨迹生成方法。所提出的方法结合了B样条和四元数插值技术,以处理解耦的位置信息和方向数据点。通过拟合平滑的分段贝塞尔曲线,实现了方向与弧长参数化位置数据的同步,这些曲线描述了路径长度与工具方向之间的非线性关系,并通过序列二次规划求解。利用贝塞尔曲线的凸包特性,该方法确保了多智能体轨迹生成的空间和时间分离约束。单位四元数被用于方向插值,提供了一种稳健且高效的表示方式,避免了万向节锁定,并促进了平滑、连续的旋转。位置插值采用修正多项式。时间轨迹通过最小颤动、时间最优的分段贝塞尔曲线在两个阶段进行优化:任务空间和关节空间,实施在低成本微控制器上。实验结果表明,所提出的方法在精度、速度波动减少和计算效率方面,相较于传统插值方法具有显著提升。
cs.RO / 3 / 2602.16825

RRT$^\eta$: Sampling-based Motion Planning and Control from STL Specifications using Arithmetic-Geometric Mean Robustness

RRT$^ heta$: 基于采样的运动规划与控制,利用算术-几何均值鲁棒性从 STL 规格出发
Ahmad, Ahmad, Liu, Shuo, Tron, Roberto, Belta, Calin
Abstract
Sampling-based motion planning has emerged as a powerful approach for robotics, enabling exploration of complex, high-dimensional configuration spaces. When combined with Signal Temporal Logic (STL), a temporal logic widely used for formalizing interpretable robotic tasks, these methods can address complex spatiotemporal constraints. However, traditional approaches rely on min-max robustness measures that focus only on critical time points and subformulae, creating non-smooth optimization landscapes with sharp decision boundaries that hinder efficient tree exploration. We propose RRT$^\eta$, a sampling-based planning framework that integrates the Arithmetic-Geometric Mean (AGM) robustness measure to evaluate satisfaction across all time points and subformulae. Our key contributions include: (1) AGM robustness interval semantics for reasoning about partial trajectories during tree construction, (2) an efficient incremental monitoring algorithm computing these intervals, and (3) enhanced Direction of Increasing Satisfaction vectors leveraging Fulfillment Priority Logic (FPL) for principled objective composition. Our framework synthesizes dynamically feasible control sequences satisfying STL specifications with high robustness while maintaining the probabilistic completeness and asymptotic optimality of RRT$^\ast$. We validate our approach on three robotic systems. A double integrator point robot, a unicycle mobile robot, and a 7-DOF robot arm, demonstrating superior performance over traditional STL robustness-based planners in multi-constraint scenarios with limited guidance signals.
Chinese Translation
基于采样的运动规划已成为机器人技术中一种强大的方法,能够探索复杂的高维配置空间。当与信号时序逻辑(STL)结合时,这种方法可以解决复杂的时空约束,STL 是一种广泛用于形式化可解释机器人任务的时序逻辑。然而,传统方法依赖于仅关注关键时间点和子公式的最小-最大鲁棒性度量,这导致了具有尖锐决策边界的非平滑优化景观,从而阻碍了高效的树形探索。我们提出了 RRT$^ heta$,一种基于采样的规划框架,集成了算术-几何均值(AGM)鲁棒性度量,以评估所有时间点和子公式的满足情况。我们的主要贡献包括:(1)AGM 鲁棒性区间语义,用于在树构建过程中推理部分轨迹;(2)一种高效的增量监控算法,用于计算这些区间;(3)利用履行优先逻辑(FPL)增强的满意度增加方向向量,以实现原则性的目标组合。我们的框架合成满足 STL 规格的动态可行控制序列,具有高鲁棒性,同时保持 RRT$^ heta$ 的概率完整性和渐近最优性。我们在三个机器人系统上验证了我们的方法:一个双积分点机器人,一个独轮车移动机器人,以及一个7自由度机器人手臂,展示了在多约束场景中相较于传统基于 STL 鲁棒性的规划器的优越性能,尤其是在指导信号有限的情况下。
cs.RO / 4 / 2602.16846

Sound of Touch: Active Acoustic Tactile Sensing via String Vibrations

触觉的声音:通过弦振动实现主动声学触觉传感
Yi, Xili, Xing, Ying, Manchester, Zachary, Fazeli, Nima
Abstract
Distributed tactile sensing remains difficult to scale over large areas: dense sensor arrays increase wiring, cost, and fragility, while many alternatives provide limited coverage or miss fast interaction dynamics. We present Sound of Touch, an active acoustic tactile-sensing methodology that uses vibrating tensioned strings as sensing elements. The string is continuously excited electromagnetically, and a small number of pickups (contact microphones) observe spectral changes induced by contact. From short-duration audio signals, our system estimates contact location and normal force, and detects slip. To guide design and interpret the sensing mechanism, we derive a physics-based string-vibration simulator that predicts how contact position and force shift vibration modes. Experiments demonstrate millimeter-scale localization, reliable force estimation, and real-time slip detection. Our contributions are: (i) a lightweight, scalable string-based tactile sensing hardware concept for instrumenting extended robot surfaces; (ii) a physics-grounded simulation and analysis tool for contact-induced spectral shifts; and (iii) a real-time inference pipeline that maps vibration measurements to contact state.
Chinese Translation
分布式触觉传感在大面积范围内的扩展仍然困难:密集的传感器阵列增加了布线、成本和脆弱性,而许多替代方案提供的覆盖范围有限或无法捕捉快速交互动态。我们提出了触觉的声音(Sound of Touch),这是一种主动声学触觉传感方法,利用振动的张紧弦作为传感元件。该弦通过电磁方式持续激励,并且少量拾音器(接触麦克风)观察接触引起的频谱变化。通过短时音频信号,我们的系统估计接触位置和法向力,并检测滑动。为了指导设计和解释传感机制,我们推导出一个基于物理的弦振动模拟器,预测接触位置和力如何改变振动模式。实验表明毫米级定位、可靠的力估计和实时滑动检测。我们的贡献包括:(i)一种轻量化、可扩展的基于弦的触觉传感硬件概念,用于对扩展机器人表面进行仪器化;(ii)一个基于物理的模拟和分析工具,用于接触引起的频谱变化;以及(iii)一个实时推断管道,将振动测量映射到接触状态。
cs.RO / 5 / 2602.16861

"Hello, I'm Delivering. Let Me Pass By": Navigating Public Pathways with Walk-along with Robots in Crowded City Streets

你好,我在送货。请让我通过”:在拥挤城市街道中与机器人同行导航公共通道
Cheon, EunJeong, Shin, Do Yeon
Abstract
As the presence of autonomous robots in public spaces increases-whether navigating campus walkways or neighborhood sidewalks-understanding how to carefully study these robots becomes critical. While HRI research has conducted field studies in public spaces, these are often limited to controlled experiments with prototype robots or structured observational methods, such as the Wizard of Oz technique. However, the autonomous mobile robots we encounter today, particularly delivery robots, operate beyond the control of researchers, navigating dynamic routes and unpredictable environments. To address this challenge, a more deliberate approach is required. Drawing inspiration from public realm ethnography in urban studies, geography, and sociology, this paper proposes the Walk-Along with Robots (WawR) methodology. We outline the key features of this method, the steps we applied in our study, the unique insights it offers, and the ways it can be evaluated. We hope this paper stimulates further discussion on research methodologies for studying autonomous robots in public spaces.
Chinese Translation
随着自主机器人在公共空间中的出现日益增多——无论是在校园步道还是社区人行道——深入理解如何仔细研究这些机器人变得至关重要。尽管人机交互(HRI)研究已在公共空间进行实地研究,但这些研究通常限于对原型机器人的控制实验或结构化观察方法,如“奥兹巫师”技术。然而,我们今天所遇到的自主移动机器人,特别是送货机器人,超出了研究人员的控制,能够在动态路线和不可预测的环境中导航。为了解决这一挑战,需要采取更为深思熟虑的方法。本文借鉴了城市研究、地理学和社会学中的公共领域民族志,提出了“与机器人同行”(Walk-Along with Robots, WawR)的方法论。我们概述了该方法的关键特征、在研究中应用的步骤、所提供的独特见解以及评估方式。我们希望本文能够激发关于研究公共空间中自主机器人研究方法的进一步讨论。
cs.RO / 6 / 2602.16863

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

SimToolReal:一种面向对象的零-shot灵巧工具操作策略
Kedia, Kushal, Lum, Tyler Ga Wei, Bohg, Jeannette, Liu, C. Karen
Abstract
The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.
Chinese Translation
操作工具的能力显著扩展了机器人可以执行的任务集合。然而,工具操作代表了一类具有挑战性的灵巧性,要求抓取细小物体、进行手中物体旋转以及进行有力的交互。由于收集这些行为的遥操作数据具有挑战性,模拟到现实的强化学习(sim-to-real reinforcement learning, RL)成为一种有前景的替代方案。然而,以往的方法通常需要大量的工程工作来建模物体并为每个任务调整奖励函数。在本研究中,我们提出了SimToolReal,向将模拟到现实的RL策略推广到工具操作迈出了重要一步。我们并不专注于单一物体和任务,而是在模拟中程序性地生成大量工具类物体原型,并训练一个具有普遍目标的单一RL策略,以将每个物体操作到随机目标姿态。该方法使SimToolReal能够在测试时执行通用的灵巧工具操作,而无需任何特定于物体或任务的训练。我们证明SimToolReal在性能上比以往的重定向和固定抓取方法提高了37%,同时与针对特定目标物体和任务训练的专业RL策略的性能相匹配。最后,我们展示了SimToolReal在多样化的日常工具中具有良好的泛化能力,在覆盖24个任务、12个物体实例和6个工具类别的120个真实世界实验中实现了强大的零-shot性能。
cs.RO / 7 / 2602.16870

Boreas Road Trip: A Multi-Sensor Autonomous Driving Dataset on Challenging Roads

博雷亚斯公路旅行:一项关于挑战性道路的多传感器自主驾驶数据集
Lisus, Daniil, Papais, Katya M., Gentil, Cedric Le, Preston-Krebs, Elliot, Lambert, Andrew, Leung, Keith Y. K., Barfoot, Timothy D.
Abstract
The Boreas Road Trip (Boreas-RT) dataset extends the multi-season Boreas dataset to new and diverse locations that pose challenges for modern autonomous driving algorithms. Boreas-RT comprises 60 sequences collected over 9 real-world routes, totalling 643 km of driving. Each route is traversed multiple times, enabling evaluation in identical environments under varying traffic and, in some cases, weather conditions. The data collection platform includes a 5MP FLIR Blackfly S camera, a 360 degree Navtech RAS6 Doppler-enabled spinning radar, a 128-channel 360 degree Velodyne Alpha Prime lidar, an Aeva Aeries II FMCW Doppler-enabled lidar, a Silicon Sensing DMU41 inertial measurement unit, and a Dynapar wheel encoder. Centimetre-level ground truth is provided via post-processed Applanix POS LV GNSS-INS data. The dataset includes precise extrinsic and intrinsic calibrations, a publicly available development kit, and a live leaderboard for odometry and metric localization. Benchmark results show that many state-of-the-art odometry and localization algorithms overfit to simple driving environments and degrade significantly on the more challenging Boreas-RT routes. Boreas-RT provides a unified dataset for evaluating multi-modal algorithms across diverse road conditions. The dataset, leaderboard, and development kit are available at www.boreas.utias.utoronto.ca.
Chinese Translation
博雷亚斯公路旅行(Boreas-RT)数据集将多季节的博雷亚斯数据集扩展到新的多样化地点,这些地点对现代自主驾驶算法提出了挑战。Boreas-RT包含在9条真实世界路线中收集的60个序列,总计643公里的驾驶数据。每条路线被多次行驶,使得在不同交通和某些情况下的天气条件下能够在相同环境中进行评估。数据采集平台包括一台5MP FLIR Blackfly S相机、一台360度Navtech RAS6多普勒旋转雷达、一台128通道360度Velodyne Alpha Prime激光雷达、一台Aeva Aeries II FMCW多普勒激光雷达、一台Silicon Sensing DMU41惯性测量单元以及一台Dynapar轮编码器。通过后处理的Applanix POS LV GNSS-INS数据提供厘米级的真实值。该数据集包括精确的外部和内部标定、一个公开可用的开发工具包以及一个用于里程计和度量定位的实时排行榜。基准结果表明,许多最先进的里程计和定位算法在简单驾驶环境中过拟合,并在更具挑战性的Boreas-RT路线中显著降级。Boreas-RT提供了一个统一的数据集,用于评估多模态算法在多样化道路条件下的表现。数据集、排行榜和开发工具包可在www.boreas.utias.utoronto.ca获取。
cs.RO / 8 / 2602.16898

MALLVI: a multi agent framework for integrated generalized robotics manipulation

MALLVI:一个用于集成广义机器人操作的多智能体框架
Ahmadi, Iman, Taji, Mehrshad, Kashani, Arad Mahdinezhad, Jadidi, AmirHossein, Kashani, Saina, Khalaj, Babak
Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
Chinese Translation
基于大语言模型(LLMs)的机器人操作任务规划是一个新兴领域。以往的方法依赖于专门的模型、微调或提示调优,通常以开放循环的方式运行,缺乏稳健的环境反馈,这使得它们在动态环境中表现脆弱。我们提出了MALLVi,一个多智能体大语言和视觉框架,能够实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVi生成可执行的原子动作供机器人操控器使用。在动作执行后,视觉语言模型(VLM)评估环境反馈,并决定是重复该过程还是继续下一步。MALLVi并不是使用单一模型,而是协调专门的智能体,包括分解者(Decomposer)、定位者(Localizer)、思考者(Thinker)和反思者(Reflector),以管理感知、定位、推理和高层规划。可选的描述者(Descriptor)智能体提供初始状态的视觉记忆。反思者通过仅重新激活相关智能体来支持针对性的错误检测和恢复,从而避免全面重新规划。在模拟和现实环境中的实验表明,迭代的闭环多智能体协调提高了泛化能力,并增加了零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。
cs.RO / 9 / 2602.16911

SparTa: Sparse Graphical Task Models from a Handful of Demonstrations

SparTa:从少量示例中学习稀疏图形任务模型
Röfer, Adrian, Heppert, Nick, Valada, Abhinav
Abstract
Learning long-horizon manipulation tasks efficiently is a central challenge in robot learning from demonstration. Unlike recent endeavors that focus on directly learning the task in the action domain, we focus on inferring what the robot should achieve in the task, rather than how to do so. To this end, we represent evolving scene states using a series of graphical object relationships. We propose a demonstration segmentation and pooling approach that extracts a series of manipulation graphs and estimates distributions over object states across task phases. In contrast to prior graph-based methods that capture only partial interactions or short temporal windows, our approach captures complete object interactions spanning from the onset of control to the end of the manipulation. To improve robustness when learning from multiple demonstrations, we additionally perform object matching using pre-trained visual features. In extensive experiments, we evaluate our method's demonstration segmentation accuracy and the utility of learning from multiple demonstrations for finding a desired minimal task model. Finally, we deploy the fitted models both in simulation and on a real robot, demonstrating that the resulting task representations support reliable execution across environments.
Chinese Translation
高效学习长时间范围的操作任务是机器人从示例学习中的一个核心挑战。与最近专注于直接在动作领域学习任务的研究不同,我们关注于推断机器人在任务中应达到的目标,而不是如何实现这一目标。为此,我们使用一系列图形对象关系来表示不断变化的场景状态。我们提出了一种示例分割和汇聚的方法,该方法提取一系列操作图并估计任务阶段中对象状态的分布。与先前仅捕捉部分交互或短时间窗口的图形方法相比,我们的方法捕捉了从控制开始到操作结束的完整对象交互。为了提高从多个示例中学习的鲁棒性,我们还使用预训练的视觉特征进行对象匹配。在大量实验中,我们评估了我们方法的示例分割准确性,以及从多个示例中学习以寻找期望的最小任务模型的实用性。最后,我们在仿真和真实机器人上部署了拟合的模型,证明所得到的任务表示支持在不同环境中的可靠执行。
cs.RO / 10 / 2602.17101

Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success

物体姿态估计与重建对机器人抓取成功影响的基准测试
Burde, Varun, Burget, Pavel, Sattler, Torsten
Abstract
3D reconstruction serves as the foundational layer for numerous robotic perception tasks, including 6D object pose estimation and grasp pose generation. Modern 3D reconstruction methods for objects can produce visually and geometrically impressive meshes from multi-view images, yet standard geometric evaluations do not reflect how reconstruction quality influences downstream tasks such as robotic manipulation performance. This paper addresses this gap by introducing a large-scale, physics-based benchmark that evaluates 6D pose estimators and 3D mesh models based on their functional efficacy in grasping. We analyze the impact of model fidelity by generating grasps on various reconstructed 3D meshes and executing them on the ground-truth model, simulating how grasp poses generated with an imperfect model affect interaction with the real object. This assesses the combined impact of pose error, grasp robustness, and geometric inaccuracies from 3D reconstruction. Our results show that reconstruction artifacts significantly decrease the number of grasp pose candidates but have a negligible effect on grasping performance given an accurately estimated pose. Our results also reveal that the relationship between grasp success and pose error is dominated by spatial error, and even a simple translation error provides insight into the success of the grasping pose of symmetric objects. This work provides insight into how perception systems relate to object manipulation using robots.
Chinese Translation
三维重建作为众多机器人感知任务的基础层,包括6D物体姿态估计和抓取姿态生成。现代物体三维重建方法能够从多视角图像中生成视觉和几何上令人印象深刻的网格,但标准几何评估并未反映重建质量如何影响下游任务,如机器人操作性能。本文通过引入一个大规模的基于物理的基准测试,填补了这一空白,评估6D姿态估计器和3D网格模型在抓取中的功能有效性。我们通过在各种重建的3D网格上生成抓取,并在真实模型上执行这些抓取,分析模型保真度的影响,模拟使用不完美模型生成的抓取姿态如何影响与真实物体的交互。这评估了姿态误差、抓取鲁棒性和3D重建中的几何不准确性综合影响。我们的结果表明,重建伪影显著减少了抓取姿态候选的数量,但在姿态估计准确的情况下,对抓取性能的影响微乎其微。我们的结果还揭示了抓取成功与姿态误差之间的关系主要受空间误差的主导,即使是简单的平移误差也能提供对对称物体抓取姿态成功的洞察。该研究为感知系统与机器人物体操作之间的关系提供了深刻的见解。
cs.RO / 11 / 2602.17110

Grasp Synthesis Matching From Rigid To Soft Robot Grippers Using Conditional Flow Matching

从刚性到柔性机器人抓手的抓取合成匹配:基于条件流匹配
Parulekar, Tanisha, Shi, Ge, Pinskier, Josh, Howard, David, Chung, Jen Jen
Abstract
A representation gap exists between grasp synthesis for rigid and soft grippers. Anygrasp [1] and many other grasp synthesis methods are designed for rigid parallel grippers, and adapting them to soft grippers often fails to capture their unique compliant behaviors, resulting in data-intensive and inaccurate models. To bridge this gap, this paper proposes a novel framework to map grasp poses from a rigid gripper model to a soft Fin-ray gripper. We utilize Conditional Flow Matching (CFM), a generative model, to learn this complex transformation. Our methodology includes a data collection pipeline to generate paired rigid-soft grasp poses. A U-Net autoencoder conditions the CFM model on the object's geometry from a depth image, allowing it to learn a continuous mapping from an initial Anygrasp pose to a stable Fin-ray gripper pose. We validate our approach on a 7-DOF robot, demonstrating that our CFM-generated poses achieve a higher overall success rate for seen and unseen objects (34% and 46% respectively) compared to the baseline rigid poses (6% and 25% respectively) when executed by the soft gripper. The model shows significant improvements, particularly for cylindrical (50% and 100% success for seen and unseen objects) and spherical objects (25% and 31% success for seen and unseen objects), and successfully generalizes to unseen objects. This work presents CFM as a data-efficient and effective method for transferring grasp strategies, offering a scalable methodology for other soft robotic systems.
Chinese Translation
刚性抓手和柔性抓手之间存在表示差距。Anygrasp [1] 及许多其他抓取合成方法是为刚性平行抓手设计的,适应这些方法到柔性抓手通常无法捕捉其独特的柔顺行为,导致数据密集且不准确的模型。为了解决这一问题,本文提出了一种新颖的框架,将刚性抓手模型的抓取姿态映射到柔性鳍状抓手。我们利用条件流匹配(Conditional Flow Matching, CFM)这一生成模型来学习这一复杂的转换。我们的方法论包括一个数据收集管道,以生成成对的刚性-柔性抓取姿态。U-Net 自编码器根据深度图像中的物体几何形状对 CFM 模型进行条件化,使其能够学习从初始 Anygrasp 姿态到稳定的鳍状抓手姿态的连续映射。我们在一个 7 自由度(DOF)机器人上验证了我们的方法,结果表明,与基线刚性姿态(分别为 6% 和 25%)相比,我们生成的 CFM 姿态在已见和未见物体上的整体成功率更高(分别为 34% 和 46%)。该模型显示出显著的改进,特别是在圆柱形(已见和未见物体的成功率分别为 50% 和 100%)和球形物体(已见和未见物体的成功率分别为 25% 和 31%)上,并成功地推广到未见物体。本文展示了 CFM 作为一种数据高效且有效的抓取策略转移方法,为其他柔性机器人系统提供了可扩展的方法论。
cs.RO / 12 / 2602.17128

Physical Human-Robot Interaction for Grasping in Augmented Reality via Rigid-Soft Robot Synergy

通过刚性-柔性机器人协同实现增强现实中的物理人机交互抓取
Huang, Huishi, Klusmann, Jack, Wang, Haozhe, Ji, Shuchen, Ying, Fengkang, Zhang, Yiyuan, Nassour, John, Cheng, Gordon, Rus, Daniela, Liu, Jun, Ang Jr, Marcelo H, Laschi, Cecilia
Abstract
Hybrid rigid-soft robots combine the precision of rigid manipulators with the compliance and adaptability of soft arms, offering a promising approach for versatile grasping in unstructured environments. However, coordinating hybrid robots remains challenging, due to difficulties in modeling, perception, and cross-domain kinematics. In this work, we present a novel augmented reality (AR)-based physical human-robot interaction framework that enables direct teleoperation of a hybrid rigid-soft robot for simple reaching and grasping tasks. Using an AR headset, users can interact with a simulated model of the robotic system integrated into a general-purpose physics engine, which is superimposed on the real system, allowing simulated execution prior to real-world deployment. To ensure consistent behavior between the virtual and physical robots, we introduce a real-to-simulation parameter identification pipeline that leverages the inherent geometric properties of the soft robot, enabling accurate modeling of its static and dynamic behavior as well as the control system's response.
Chinese Translation
混合刚性-柔性机器人结合了刚性操纵器的精确性与柔性臂的顺应性和适应性,为在非结构化环境中实现多功能抓取提供了一种有前景的方法。然而,由于建模、感知和跨域运动学的困难,协调混合机器人仍然具有挑战性。在本研究中,我们提出了一种基于增强现实(AR)的物理人机交互框架,使用户能够直接遥操作混合刚性-柔性机器人进行简单的到达和抓取任务。用户通过AR头戴设备与集成在通用物理引擎中的机器人系统模拟模型进行交互,该模型叠加在真实系统上,允许在实际部署之前进行模拟执行。为了确保虚拟机器人与物理机器人之间的一致行为,我们引入了一种从真实到仿真的参数识别流程,该流程利用柔性机器人的固有几何特性,能够准确建模其静态和动态行为以及控制系统的响应。
cs.RO / 13 / 2602.17166

Geometric Inverse Flight Dynamics on SO(3) and Application to Tethered Fixed-Wing Aircraft

SO(3)上的几何逆飞行动力学及其在系留固定翼飞机中的应用
Franchi, Antonio, Gabellieri, Chiara
Abstract
We present a robotics-oriented, coordinate-free formulation of inverse flight dynamics for fixed-wing aircraft on SO(3). Translational force balance is written in the world frame and rotational dynamics in the body frame; aerodynamic directions (drag, lift, side) are defined geometrically, avoiding local attitude coordinates. Enforcing coordinated flight (no sideslip), we derive a closed-form trajectory-to-input map yielding the attitude, angular velocity, and thrust-angle-of-attack pair, and we recover the aerodynamic moment coefficients component-wise. Applying such a map to tethered flight on spherical parallels, we obtain analytic expressions for the required bank angle and identify a specific zero-bank locus where the tether tension exactly balances centrifugal effects, highlighting the decoupling between aerodynamic coordination and the apparent gravity vector. Under a simple lift/drag law, the minimal-thrust angle of attack admits a closed form. These pointwise quasi-steady inversion solutions become steady-flight trim when the trajectory and rotational dynamics are time-invariant. The framework bridges inverse simulation in aeronautics with geometric modeling in robotics, providing a rigorous building block for trajectory design and feasibility checks.
Chinese Translation
我们提出了一种面向机器人技术的、无坐标的固定翼飞机逆飞行动力学的公式化,基于SO(3)群。平移力平衡在世界坐标系中书写,而旋转动力学则在机体坐标系中进行描述;气动方向(阻力、升力、侧向力)通过几何方式定义,避免了局部姿态坐标的使用。在强制协调飞行(无侧滑)的条件下,我们推导出一个闭式的轨迹到输入的映射,得到了姿态、角速度和推力攻角的配对,同时逐项恢复了气动力矩系数。将该映射应用于球面平行线上的系留飞行,我们获得了所需银行角的解析表达,并识别出一个特定的零银行轨迹,在该轨迹上系留张力恰好平衡离心效应,突显了气动协调与表观重力向量之间的解耦。在简单的升力/阻力法则下,最小推力攻角具有闭式解。这些点位准稳态的逆转解在轨迹和旋转动力学时间不变时变为稳态飞行的修整。该框架将航空学中的逆模拟与机器人中的几何建模相结合,为轨迹设计和可行性检查提供了一个严谨的基础构件。
cs.RO / 14 / 2602.17199

Nonlinear Predictive Control of the Continuum and Hybrid Dynamics of a Suspended Deformable Cable for Aerial Pick and Place

悬挂可变形电缆的连续体与混合动力的非线性预测控制
Rapuano, Antonio, Shen, Yaolei, Califano, Federico, Gabellieri, Chiara, Franchi, Antonio
Abstract
This paper presents a framework for aerial manipulation of an extensible cable that combines a high-fidelity model based on partial differential equations (PDEs) with a reduced-order representation suitable for real-time control. The PDEs are discretised using a finite-difference method, and proper orthogonal decomposition is employed to extract a reduced-order model (ROM) that retains the dominant deformation modes while significantly reducing computational complexity. Based on this ROM, a nonlinear model predictive control scheme is formulated, capable of stabilizing cable oscillations and handling hybrid transitions such as payload attachment and detachment. Simulation results confirm the stability, efficiency, and robustness of the ROM, as well as the effectiveness of the controller in regulating cable dynamics under a range of operating conditions. Additional simulations illustrate the application of the ROM for trajectory planning in constrained environments, demonstrating the versatility of the proposed approach. Overall, the framework enables real-time, dynamics-aware control of unmanned aerial vehicles (UAVs) carrying suspended flexible cables.
Chinese Translation
本文提出了一种用于可扩展电缆空中操作的框架,该框架结合了基于偏微分方程(PDEs)的高保真模型与适用于实时控制的降阶表示。通过有限差分法对PDEs进行离散化,并采用适当的正交分解提取保留主导变形模式的降阶模型(ROM),同时显著降低计算复杂性。基于该ROM,制定了一种非线性模型预测控制方案,能够稳定电缆振荡并处理负载附加和分离等混合过渡。仿真结果确认了ROM的稳定性、效率和鲁棒性,以及控制器在各种操作条件下调节电缆动力学的有效性。额外的仿真展示了ROM在受限环境中轨迹规划的应用,证明了所提方法的多样性。总体而言,该框架实现了对携带悬挂柔性电缆的无人机(UAV)的实时、动态感知控制。
cs.RO / 15 / 2602.17226

Multi-session Localization and Mapping Exploiting Topological Information

利用拓扑信息的多会话定位与地图构建
Montano-Olivan, Lorenzo, Placed, Julio A., Montano, Luis, Lazaro, Maria T.
Abstract
Operating in previously visited environments is becoming increasingly crucial for autonomous systems, with direct applications in autonomous driving, surveying, and warehouse or household robotics. This repeated exposure to observing the same areas poses significant challenges for mapping and localization -- key components for enabling any higher-level task. In this work, we propose a novel multi-session framework that builds on map-based localization, in contrast to the common practice of greedily running full SLAM sessions and trying to find correspondences between the resulting maps. Our approach incorporates a topology-informed, uncertainty-aware decision-making mechanism that analyzes the pose-graph structure to detect low-connectivity regions, selectively triggering mapping and loop closing modules. The resulting map and pose-graph are seamlessly integrated into the existing model, reducing accumulated error and enhancing global consistency. We validate our method on overlapping sequences from datasets and demonstrate its effectiveness in a real-world mine-like environment.
Chinese Translation
在以前访问过的环境中进行操作对自主系统变得越来越重要,直接应用于自动驾驶、测绘以及仓库或家庭机器人等领域。这种重复接触同一区域的情况给地图构建和定位带来了显著挑战,而这两者是实现任何更高级任务的关键组成部分。在本研究中,我们提出了一种新颖的多会话框架,该框架基于地图的定位,与常见的贪婪运行完整SLAM(同步定位与地图构建)会话并尝试在生成的地图之间寻找对应关系的做法形成对比。我们的方法结合了一种基于拓扑信息的、考虑不确定性的决策机制,该机制分析姿态图结构以检测低连接区域,选择性地触发地图构建和回环闭合模块。最终生成的地图和姿态图无缝集成到现有模型中,减少了累积误差并增强了全局一致性。我们在数据集的重叠序列上验证了我们的方法,并在类似矿井的真实环境中展示了其有效性。
cs.RO / 16 / 2602.17259

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

FRAPPE:通过多重未来表示对齐将世界建模融入通用策略
Zhao, Han, Wang, Jingbo, Song, Wenxuan, Chen, Shuai, Liu, Yang, Wang, Yan, Li, Haoang, Wang, Donglin
Abstract
Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.
Chinese Translation
使VLA模型能够预测环境动态,即世界建模,被认为是提高机器人推理和泛化能力的关键。然而,目前的方法面临两个主要问题:1. 训练目标迫使模型过度强调像素级重建,这限制了语义学习和泛化;2. 在推理过程中依赖于预测的未来观察往往导致错误累积。为了解决这些挑战,我们提出了通过并行渐进扩展实现的未来表示对齐(FRAPPE)。我们的方法采用两阶段微调策略:在中期训练阶段,模型学习预测未来观察的潜在表示;在后期训练阶段,我们并行扩展计算工作量,并与多个不同的视觉基础模型同时对齐表示。通过显著提高微调效率并减少对动作标注数据的依赖,FRAPPE提供了一条可扩展且数据高效的途径,以增强通用机器人策略的世界意识。在RoboTwin基准和现实任务上的实验表明,FRAPPE超越了最先进的方法,并在长时间跨度和未见场景中表现出强大的泛化能力。
cs.RO / 17 / 2602.17393

Contact-Anchored Proprioceptive Odometry for Quadruped Robots

基于接触锚定的四足机器人本体感觉里程计
Sun, Minxing, Mao, Yao
Abstract
Reliable odometry for legged robots without cameras or LiDAR remains challenging due to IMU drift and noisy joint velocity sensing. This paper presents a purely proprioceptive state estimator that uses only IMU and motor measurements to jointly estimate body pose and velocity, with a unified formulation applicable to biped, quadruped, and wheel-legged robots. The key idea is to treat each contacting leg as a kinematic anchor: joint-torque--based foot wrench estimation selects reliable contacts, and the corresponding footfall positions provide intermittent world-frame constraints that suppress long-term drift. To prevent elevation drift during extended traversal, we introduce a lightweight height clustering and time-decay correction that snaps newly recorded footfall heights to previously observed support planes. To improve foot velocity observations under encoder quantization, we apply an inverse-kinematics cubature Kalman filter that directly filters foot-end velocities from joint angles and velocities. The implementation further mitigates yaw drift through multi-contact geometric consistency and degrades gracefully to a kinematics-derived heading reference when IMU yaw constraints are unavailable or unreliable. We evaluate the method on four quadruped platforms (three Astrall robots and a Unitree Go2 EDU) using closed-loop trajectories. On Astrall point-foot robot~A, a $\sim$200\,m horizontal loop and a $\sim$15\,m vertical loop return with 0.1638\,m and 0.219\,m error, respectively; on wheel-legged robot~B, the corresponding errors are 0.2264\,m and 0.199\,m. On wheel-legged robot~C, a $\sim$700\,m horizontal loop yields 7.68\,m error and a $\sim$20\,m vertical loop yields 0.540\,m error. Unitree Go2 EDU closes a $\sim$120\,m horizontal loop with 2.2138\,m error and a $\sim$8\,m vertical loop with less than 0.1\,m vertical error. github.com/ShineMinxing/Ros2Go2Estimator.git
Chinese Translation
对于没有摄像头或激光雷达的腿式机器人,可靠的里程计仍然面临挑战,这主要是由于惯性测量单元(IMU)漂移和关节速度传感器噪声。本文提出了一种纯粹的本体感觉状态估计器,仅使用IMU和电机测量来联合估计身体姿态和速度,采用统一的公式适用于双足、四足和轮腿机器人。关键思想是将每个接触的腿视为运动学锚点:基于关节扭矩的足部扭矩估计选择可靠的接触点,相应的足部接触位置提供间歇性的世界坐标系约束,从而抑制长期漂移。为了防止在长时间行驶过程中高度漂移,我们引入了一种轻量级的高度聚类和时间衰减修正,将新记录的足部高度调整到先前观察到的支撑平面。为了改善在编码器量化下的足部速度观测,我们应用了一种逆运动学立方卡尔曼滤波器,直接从关节角度和速度中滤波足端速度。该实现进一步通过多接触几何一致性减轻偏航漂移,并在IMU偏航约束不可用或不可靠时优雅地退化为基于运动学的航向参考。我们在四个四足平台(三个Astrall机器人和一个Unitree Go2 EDU)上使用闭环轨迹评估该方法。在Astrall点足机器人A上,约200米的水平环路和约15米的垂直环路的误差分别为0.1638米和0.219米;在轮腿机器人B上,相应的误差为0.2264米和0.199米。在轮腿机器人C上,约700米的水平环路产生7.68米的误差,约20米的垂直环路产生0.540米的误差。Unitree Go2 EDU闭合了约120米的水平环路,误差为2.2138米,约8米的垂直环路误差小于0.1米。github.com/ShineMinxing/Ros2Go2Estimator.git
cs.RO / 18 / 2602.17415

Distributed Virtual Model Control for Scalable Human-Robot Collaboration in Shared Workspace

用于可扩展人机协作的分布式虚拟模型控制在共享工作空间中的应用
Zhang, Yi, Faris, Omar, Sirithunge, Chapa, Chu, Kai-Fung, Iida, Fumiya, Forni, Fulvio
Abstract
We present a decentralized, agent agnostic, and safety-aware control framework for human-robot collaboration based on Virtual Model Control (VMC). In our approach, both humans and robots are embedded in the same virtual-component-shaped workspace, where motion is the result of the interaction with virtual springs and dampers rather than explicit trajectory planning. A decentralized, force-based stall detector identifies deadlocks, which are resolved through negotiation. This reduces the probability of robots getting stuck in the block placement task from up to 61.2% to zero in our experiments. The framework scales without structural changes thanks to the distributed implementation: in experiments we demonstrate safe collaboration with up to two robots and two humans, and in simulation up to four robots, maintaining inter-agent separation at around 20 cm. Results show that the method shapes robot behavior intuitively by adjusting control parameters and achieves deadlock-free operation across team sizes in all tested scenarios.
Chinese Translation
我们提出了一种基于虚拟模型控制(Virtual Model Control, VMC)的去中心化、与代理无关且关注安全的控制框架,以实现人机协作。在我们的方法中,人类与机器人共同嵌入在同一个虚拟组件形状的工作空间中,运动是与虚拟弹簧和阻尼器相互作用的结果,而不是明确的轨迹规划。一个去中心化的基于力的停滞检测器识别死锁,并通过协商来解决。这将机器人在放置任务中卡住的概率从实验中的61.2%降低到零。由于分布式实现,该框架在不进行结构性改变的情况下可扩展:在实验中,我们展示了与最多两名机器人和两名人类的安全协作,而在模拟中则可扩展至四个机器人,保持代理间的距离约为20厘米。结果表明,该方法通过调整控制参数直观地塑造了机器人的行为,并在所有测试场景中实现了跨团队规模的无死锁操作。
cs.RO / 19 / 2602.17421

3D-printed Soft Optical sensor with a Lens (SOLen) for light guidance in mechanosensing

用于机械感应的3D打印软光学传感器及其透镜(SOLen)在光导引中的应用
Cafiso, Diana, Trunin, Petr, Gay, Carolina, Beccai, Lucia
Abstract
Additive manufacturing is enabling soft robots with increasingly complex geometries, creating a demand for sensing solutions that remain compatible with single-material, one-step fabrication. Optical soft sensors are attractive for monolithic printing, but their performance is often degraded by uncontrolled light propagation (ambient coupling, leakage, scattering), while common miti- gation strategies typically require multimaterial interfaces. Here, we present an approach for 3D printed soft optical sensing (SOLen), in which a printed lens is placed in front of an emitter within a Y-shaped waveguide. The sensing mechanism relies on deformation-induced lens rotation and focal-spot translation, redistributing optical power between the two branches to generate a differential output that encodes both motion direction and amplitude. An acrylate polyurethane resin was modified with lauryl acrylate to improve compliance and optical transmittance, and single-layer optical characterization was used to derive wavelength-dependent refractive index and transmittance while minimizing DLP layer-related artifacts. The measured refractive index was used in simulations to design a lens profile for a target focal distance, which was then printed with sub-millimeter fidelity. Rotational tests demonstrated reproducible branch-selective signal switching over multiple cycles. These results establish a transferable material-to-optics workflow for soft optical sensors with lens with new functionalities for next-generation soft robots
Chinese Translation
增材制造使得软机器人能够实现越来越复杂的几何形状,从而对兼容单一材料、一步成型的传感解决方案提出了需求。光学软传感器因其适合单体打印而受到关注,但其性能往往受到不受控的光传播(环境耦合、泄漏、散射)的影响,而常见的缓解策略通常需要多材料接口。在此,我们提出了一种用于3D打印软光学传感(SOLen)的方法,其中一个打印的透镜被放置在Y形波导内的发射器前。传感机制依赖于变形引起的透镜旋转和焦点位置的移动,重新分配两个分支之间的光功率,以生成编码运动方向和幅度的差分输出。我们对丙烯酸聚氨酯树脂进行了改性,添加了月桂酸丙烯酸酯,以提高其柔顺性和光学透过率,并使用单层光学表征来推导波长依赖的折射率和透过率,同时最小化DLP层相关的伪影。测得的折射率被用于模拟,以设计目标焦距的透镜轮廓,随后以亚毫米的精度进行打印。旋转测试表明,在多个循环中实现了可重复的分支选择信号切换。这些结果建立了一种可转移的材料到光学的工作流程,为具有新功能的软光学传感器与透镜的下一代软机器人奠定了基础。
cs.RO / 20 / 2602.17472

A Cost-Effective and Climate-Resilient Air Pressure System for Rain Effect Reduction on Automated Vehicle Cameras

一种具有成本效益和气候适应性的气压系统,用于减少自动驾驶车辆摄像头的雨水影响
Sabry, Mohamed, Gorospe, Joseba, Olaverri-Monreal, Cristina
Abstract
Recent advances in automated vehicles have focused on improving perception performance under adverse weather conditions; however, research on physical hardware solutions remains limited, despite their importance for perception critical applications such as vehicle platooning. Existing approaches, such as hydrophilic or hydrophobic lenses and sprays, provide only partial mitigation, while industrial protection systems imply high cost and they do not enable scalability for automotive deployment. To address these limitations, this paper presents a cost-effective hardware solution for rainy conditions, designed to be compatible with multiple cameras simultaneously. Beyond its technical contribution, the proposed solution supports sustainability goals in transportation systems. By enabling compatibility with existing camera-based sensing platforms, the system extends the operational reliability of automated vehicles without requiring additional high-cost sensors or hardware replacements. This approach reduces resource consumption, supports modular upgrades, and promotes more cost-efficient deployment of automated vehicle technologies, particularly in challenging weather conditions where system failures would otherwise lead to inefficiencies and increased emissions. The proposed system was able to increase pedestrian detection accuracy of a Deep Learning model from 8.3% to 41.6%.
Chinese Translation
近年来,自动驾驶车辆的进展集中在改善恶劣天气条件下的感知性能;然而,尽管物理硬件解决方案对感知关键应用(如车辆编队)至关重要,但相关研究仍然有限。现有的方法,如亲水或疏水镜头和喷雾,仅能部分缓解问题,而工业保护系统则意味着高成本,并且无法实现汽车部署的可扩展性。为了解决这些局限性,本文提出了一种针对雨天条件的具有成本效益的硬件解决方案,旨在与多个摄像头同时兼容。除了其技术贡献外,所提方案还支持交通系统的可持续发展目标。通过与现有基于摄像头的传感平台的兼容性,该系统在不需要额外高成本传感器或硬件更换的情况下,延长了自动驾驶车辆的操作可靠性。这种方法减少了资源消耗,支持模块化升级,并促进了在恶劣天气条件下更具成本效益的自动驾驶技术部署,否则系统故障将导致效率低下和排放增加。所提系统能够将深度学习模型的行人检测准确率从8.3%提高到41.6%。
cs.RO / 21 / 2602.17474

Optically Sensorized Electro-Ribbon Actuator (OS-ERA)

光学传感电缆驱动器 (OS-ERA)
Gay, Carolina, Trunin, Petr, Cafiso, Diana, Xu, Yuejun, Taghavi, Majid, Beccai, Lucia
Abstract
Electro-Ribbon Actuators (ERAs) are lightweight flexural actuators that exhibit ultrahigh displacement and fast movement. However, their embedded sensing relies on capacitive sensors with limited precision, which hinders accurate control. We introduce OS-ERA, an optically sensorized ERA that yields reliable proprioceptive information, and we focus on the design and integration of a sensing solution without affecting actuation. To analyse the complex curvature of an ERA in motion, we design and embed two soft optical waveguide sensors. A classifier is trained to map the sensing signals in order to distinguish eight bending states. We validate our model on six held-out trials and compare it against signals' trajectories learned from training runs. Across all tests, the sensing output signals follow the training manifold, and the predicted sequence mirrors real performance and confirms repeatability. Despite deliberate train-test mismatches in actuation speed, the signal trajectories preserve their shape, and classification remains consistently accurate, demonstrating practical voltage- and speed-invariance. As a result, OS-ERA classifies bending states with high fidelity; it is fast and repeatable, solving a longstanding bottleneck of the ERA, enabling steps toward closed-loop control.
Chinese Translation
电缆驱动器 (ERA) 是一种轻量级的弯曲驱动器,具有超高位移和快速运动的特点。然而,它们的嵌入式传感依赖于精度有限的电容传感器,这阻碍了精确控制。我们提出了 OS-ERA,一种光学传感的 ERA,能够提供可靠的本体感知信息,并专注于设计和集成一种不影响驱动的传感解决方案。为了分析运动中 ERA 的复杂曲率,我们设计并嵌入了两个柔性光波导传感器。我们训练了一个分类器来映射传感信号,以区分八种弯曲状态。我们在六个独立试验中验证了我们的模型,并将其与从训练过程中学习到的信号轨迹进行比较。在所有测试中,传感输出信号遵循训练流形,预测序列反映了真实性能并确认了重复性。尽管在驱动速度上故意进行了训练和测试的不匹配,但信号轨迹保持其形状,分类始终保持准确,展示了实际的电压和速度不变性。因此,OS-ERA 以高保真度分类弯曲状态;它快速且可重复,解决了 ERA 的一个长期瓶颈,为闭环控制迈出了重要一步。
cs.RO / 22 / 2602.17502

Proximal powered knee placement: a case study

近端动力膝关节安置:案例研究
Embry, Kyle R., Vianello, Lorenzo, Lipsey, Jim, Ursetta, Frank, Stephens, Michael, Wang, Zhi, Simon, Ann M., Ikeda, Andrea J., Finucane, Suzanne B., Anarwala, Shawana, Hargrove, Levi J.
Abstract
Lower limb amputation affects millions worldwide, leading to impaired mobility, reduced walking speed, and limited participation in daily and social activities. Powered prosthetic knees can partially restore mobility by actively assisting knee joint torque, improving gait symmetry, sit-to-stand transitions, and walking speed. However, added mass from powered components may diminish these benefits, negatively affecting gait mechanics and increasing metabolic cost. Consequently, optimizing mass distribution, rather than simply minimizing total mass, may provide a more effective and practical solution. In this exploratory study, we evaluated the feasibility of above-knee powertrain placement for a powered prosthetic knee in a small cohort. Compared to below-knee placement, the above-knee configuration demonstrated improved walking speed (+9.2% for one participant) and cadence (+3.6%), with mixed effects on gait symmetry. Kinematic measures indicated similar knee range of motion and peak velocity across configurations. Additional testing on ramps and stairs confirmed the robustness of the control strategy across multiple locomotion tasks. These preliminary findings suggest that above-knee placement is functionally feasible and that careful mass distribution can preserve the benefits of powered assistance while mitigating adverse effects of added weight. Further studies are needed to confirm these trends and guide design and clinical recommendations.
Chinese Translation
下肢截肢影响着全球数百万人的生活,导致行动能力受损、步行速度降低以及日常和社交活动参与受限。动力假肢膝关节通过主动辅助膝关节扭矩,部分恢复了行动能力,改善了步态对称性、坐立转换和步行速度。然而,动力组件的额外质量可能会削弱这些益处,负面影响步态力学并增加代谢成本。因此,优化质量分布,而不仅仅是最小化总质量,可能提供更有效和实用的解决方案。在这项探索性研究中,我们评估了在小规模人群中,上膝动力传动系统安置的可行性。与下膝安置相比,上膝配置显示出步行速度的改善(一个参与者提高了9.2%)和步频的提升(提高了3.6%),对步态对称性的影响则呈现出混合效果。运动学测量表明,不同配置下膝关节的活动范围和峰值速度相似。对坡道和楼梯的额外测试确认了控制策略在多种运动任务中的稳健性。这些初步发现表明,上膝安置在功能上是可行的,且谨慎的质量分布可以保留动力辅助的益处,同时减轻附加重量的不利影响。需要进一步的研究来确认这些趋势,并指导设计和临床建议。
cs.RO / 23 / 2602.17515

RA-Nav: A Risk-Aware Navigation System Based on Semantic Segmentation for Aerial Robots in Unpredictable Environments

RA-Nav:基于语义分割的风险感知导航系统,用于不可预测环境中的空中机器人
Zong, Ziyi, Dong, Xin, Xiang, Jinwu, Li, Daochun, Tu, Zhan
Abstract
Existing aerial robot navigation systems typically plan paths around static and dynamic obstacles, but fail to adapt when a static obstacle suddenly moves. Integrating environmental semantic awareness enables estimation of potential risks posed by suddenly moving obstacles. In this paper, we propose RA- Nav, a risk-aware navigation framework based on semantic segmentation. A lightweight multi-scale semantic segmentation network identifies obstacle categories in real time. These obstacles are further classified into three types: stationary, temporarily static, and dynamic. For each type, corresponding risk estimation functions are designed to enable real-time risk prediction, based on which a complete local risk map is constructed. Based on this map, the risk-informed path search algorithm is designed to guarantee planning that balances path efficiency and safety. Trajectory optimization is then applied to generate trajectories that are safe, smooth, and dynamically feasible. Comparative simulations demonstrate that RA-Nav achieves higher success rates than baselines in sudden obstacle state transition scenarios. Its effectiveness is further validated in simulations using real- world data.
Chinese Translation
现有的空中机器人导航系统通常围绕静态和动态障碍物规划路径,但在静态障碍物突然移动时无法适应。整合环境语义感知能够估计突然移动障碍物带来的潜在风险。本文提出了RA-Nav,一个基于语义分割的风险感知导航框架。一个轻量级的多尺度语义分割网络实时识别障碍物类别。这些障碍物进一步被分类为三种类型:静态、暂时静止和动态。针对每种类型,设计了相应的风险估计函数,以实现实时风险预测,并基于此构建完整的局部风险地图。基于该地图,设计了风险知情路径搜索算法,以保证规划在路径效率和安全性之间的平衡。随后应用轨迹优化生成安全、平滑且动态可行的轨迹。比较模拟表明,RA-Nav在突发障碍物状态转变场景中实现了比基线更高的成功率。其有效性在使用真实世界数据的模拟中得到了进一步验证。
cs.RO / 24 / 2602.17537

IRIS: Learning-Driven Task-Specific Cinema Robot Arm for Visuomotor Motion Control

IRIS:基于学习驱动的任务特定影院机器人臂用于视觉运动控制
Cheng, Qilong, Mackay, Matthew, Bereyhi, Ali
Abstract
Robotic camera systems enable dynamic, repeatable motion beyond human capabilities, yet their adoption remains limited by the high cost and operational complexity of industrial-grade platforms. We present the Intelligent Robotic Imaging System (IRIS), a task-specific 6-DOF manipulator designed for autonomous, learning-driven cinematic motion control. IRIS integrates a lightweight, fully 3D-printed hardware design with a goal-conditioned visuomotor imitation learning framework based on Action Chunking with Transformers (ACT). The system learns object-aware and perceptually smooth camera trajectories directly from human demonstrations, eliminating the need for explicit geometric programming. The complete platform costs under $1,000 USD, supports a 1.5 kg payload, and achieves approximately 1 mm repeatability. Real-world experiments demonstrate accurate trajectory tracking, reliable autonomous execution, and generalization across diverse cinematic motions.
Chinese Translation
机器人摄像系统能够实现超越人类能力的动态、可重复运动,但其应用仍受到工业级平台高成本和操作复杂性的限制。我们提出了智能机器人成像系统(Intelligent Robotic Imaging System,IRIS),这是一种专为自主、学习驱动的电影运动控制设计的6自由度(6-DOF)操纵器。IRIS将轻量级、完全3D打印的硬件设计与基于变换器(Transformers)的动作块(Action Chunking)目标条件视觉运动模仿学习框架相结合。该系统直接从人类示范中学习对象感知和平滑的摄像机轨迹,消除了对显式几何编程的需求。整个平台成本低于1000美元,支持1.5千克的有效载荷,并实现约1毫米的重复性。实际实验表明,该系统能够准确跟踪轨迹,可靠地自主执行,并在多种电影运动中实现泛化。
cs.RO / 25 / 2602.17573

FR-GESTURE: An RGBD Dataset For Gesture-based Human-Robot Interaction In First Responder Operations

FR-GESTURE:用于第一响应者操作中的基于手势的人机交互的RGBD数据集
Foteinos, Konstantinos, Angelidis, Georgios, Psiris, Aggelos, Argyriou, Vasileios, Sarigiannidis, Panagiotis, Papadopoulos, Georgios Th.
Abstract
The ever increasing intensity and number of disasters make even more difficult the work of First Responders (FRs). Artificial intelligence and robotics solutions could facilitate their operations, compensating these difficulties. To this end, we propose a dataset for gesture-based UGV control by FRs, introducing a set of 12 commands, drawing inspiration from existing gestures used by FRs and tactical hand signals and refined after incorporating feedback from experienced FRs. Then we proceed with the data collection itself, resulting in 3312 RGBD pairs captured from 2 viewpoints and 7 distances. To the best of our knowledge, this is the first dataset especially intended for gesture-based UGV guidance by FRs. Finally we define evaluation protocols for our RGBD dataset, termed FR-GESTURE, and we perform baseline experiments, which are put forward for improvement. We have made data publicly available to promote future research on the domain: https://doi.org/10.5281/zenodo.18131333.
Chinese Translation
灾害的强度和数量不断增加,使第一响应者(FRs)的工作变得更加困难。人工智能和机器人解决方案可以促进他们的操作,弥补这些困难。为此,我们提出了一个用于FRs基于手势的无人地面车辆(UGV)控制的数据集,介绍了一套12个命令,这些命令灵感来源于FRs使用的现有手势和战术手势,并在吸收了经验丰富的FRs反馈后进行了完善。接下来,我们进行了数据收集,最终获得了3312对RGBD数据,从2个视点和7个距离捕获。根据我们所知,这是第一个特别针对FRs基于手势的UGV引导的数据集。最后,我们为我们的RGBD数据集(称为FR-GESTURE)定义了评估协议,并进行了基线实验,提出了改进建议。我们已将数据公开,以促进该领域未来的研究:https://doi.org/10.5281/zenodo.18131333。
cs.RO / 26 / 2602.17574

Hybrid System Planning using a Mixed-Integer ADMM Heuristic and Hybrid Zonotopes

使用混合整数ADMM启发式算法和混合区域体的混合系统规划
Robbins, Joshua A., Thompson, Andrew F., Glunt, Jonah J., Pangborn, Herschel C.
Abstract
Embedded optimization-based planning for hybrid systems is challenging due to the use of mixed-integer programming, which is computationally intensive and often sensitive to the specific numerical formulation. To address that challenge, this article proposes a framework for motion planning of hybrid systems that pairs hybrid zonotopes - an advanced set representation - with a new alternating direction method of multipliers (ADMM) mixed-integer programming heuristic. A general treatment of piecewise affine (PWA) system reachability analysis using hybrid zonotopes is presented and extended to formulate optimal planning problems. Sets produced using the proposed identities have lower memory complexity and tighter convex relaxations than equivalent sets produced from preexisting techniques. The proposed ADMM heuristic makes efficient use of the hybrid zonotope structure. For planning problems formulated as hybrid zonotopes, the proposed heuristic achieves improved convergence rates as compared to state-of-the-art mixed-integer programming heuristics. The proposed methods for hybrid system planning on embedded hardware are experimentally applied in a combined behavior and motion planning scenario for autonomous driving.
Chinese Translation
基于嵌入优化的混合系统规划面临挑战,因为混合整数规划计算量大且通常对特定的数值形式敏感。为了解决这一挑战,本文提出了一种混合系统运动规划框架,该框架将混合区域体(hybrid zonotopes)——一种先进的集合表示——与新的交替方向乘子法(ADMM)混合整数规划启发式算法相结合。本文展示了使用混合区域体进行分段仿射(PWA)系统可达性分析的一般处理方法,并扩展到最优规划问题的公式化。使用所提出的标识生成的集合具有比现有技术生成的等效集合更低的内存复杂度和更紧的凸松弛。所提出的ADMM启发式算法有效利用了混合区域体的结构。对于以混合区域体形式公式化的规划问题,所提出的启发式算法相比于最先进的混合整数规划启发式算法实现了更好的收敛速度。所提出的混合系统规划方法在嵌入式硬件上被实验应用于自动驾驶的联合行为和运动规划场景。
cs.RO / 27 / 2602.17586

Conditional Flow Matching for Continuous Anomaly Detection in Autonomous Driving on a Manifold-Aware Spectral Space

在流形感知谱空间中用于自主驾驶的连续异常检测的条件流匹配
Guillen-Perez, Antonio
Abstract
Safety validation for Level 4 autonomous vehicles (AVs) is currently bottlenecked by the inability to scale the detection of rare, high-risk long-tail scenarios using traditional rule-based heuristics. We present Deep-Flow, an unsupervised framework for safety-critical anomaly detection that utilizes Optimal Transport Conditional Flow Matching (OT-CFM) to characterize the continuous probability density of expert human driving behavior. Unlike standard generative approaches that operate in unstable, high-dimensional coordinate spaces, Deep-Flow constrains the generative process to a low-rank spectral manifold via a Principal Component Analysis (PCA) bottleneck. This ensures kinematic smoothness by design and enables the computation of the exact Jacobian trace for numerically stable, deterministic log-likelihood estimation. To resolve multi-modal ambiguity at complex junctions, we utilize an Early Fusion Transformer encoder with lane-aware goal conditioning, featuring a direct skip-connection to the flow head to maintain intent-integrity throughout the network. We introduce a kinematic complexity weighting scheme that prioritizes high-energy maneuvers (quantified via path tortuosity and jerk) during the simulation-free training process. Evaluated on the Waymo Open Motion Dataset (WOMD), our framework achieves an AUC-ROC of 0.766 against a heuristic golden set of safety-critical events. More significantly, our analysis reveals a fundamental distinction between kinematic danger and semantic non-compliance. Deep-Flow identifies a critical predictability gap by surfacing out-of-distribution behaviors, such as lane-boundary violations and non-normative junction maneuvers, that traditional safety filters overlook. This work provides a mathematically rigorous foundation for defining statistical safety gates, enabling objective, data-driven validation for the safe deployment of autonomous fleets.
Chinese Translation
目前,4级自主驾驶汽车(AV)的安全验证受到传统基于规则的启发式方法无法扩展检测稀有、高风险长尾场景的瓶颈。我们提出了Deep-Flow,这是一种用于安全关键异常检测的无监督框架,利用最优传输条件流匹配(Optimal Transport Conditional Flow Matching, OT-CFM)来表征专家人类驾驶行为的连续概率密度。与在不稳定的高维坐标空间中操作的标准生成方法不同,Deep-Flow通过主成分分析(Principal Component Analysis, PCA)瓶颈将生成过程限制在低秩谱流形上。这确保了运动学的平滑性,并使得能够计算精确的雅可比迹,以实现数值稳定的确定性对数似然估计。为了解决复杂交叉口的多模态歧义,我们利用了具有车道感知目标条件的早期融合变压器编码器,并通过直接跳跃连接到流头,以在整个网络中保持意图完整性。我们引入了一种运动学复杂性加权方案,在无仿真训练过程中优先考虑高能量的机动(通过路径曲率和加速度量化)。在Waymo开放运动数据集(Waymo Open Motion Dataset, WOMD)上的评估表明,我们的框架在安全关键事件的启发式黄金集上实现了0.766的AUC-ROC。更重要的是,我们的分析揭示了运动学危险与语义不合规之间的根本区别。Deep-Flow通过揭示分布外行为(如车道边界违规和非规范交叉口机动)识别了一个关键的可预测性差距,这些行为是传统安全过滤器所忽视的。这项工作为定义统计安全门提供了数学严谨的基础,使得能够对自主车队的安全部署进行客观、数据驱动的验证。
cs.RO / 28 / 2602.17601

Graph Neural Model Predictive Control for High-Dimensional Systems

高维系统的图神经模型预测控制
Eberhard, Patrick Benito, Pabon, Luis, Gammelli, Daniele, Buurmeijer, Hugo, Lahr, Amon, Leone, Mark, Carron, Andrea, Pavone, Marco
Abstract
The control of high-dimensional systems, such as soft robots, requires models that faithfully capture complex dynamics while remaining computationally tractable. This work presents a framework that integrates Graph Neural Network (GNN)-based dynamics models with structure-exploiting Model Predictive Control to enable real-time control of high-dimensional systems. By representing the system as a graph with localized interactions, the GNN preserves sparsity, while a tailored condensing algorithm eliminates state variables from the control problem, ensuring efficient computation. The complexity of our condensing algorithm scales linearly with the number of system nodes, and leverages Graphics Processing Unit (GPU) parallelization to achieve real-time performance. The proposed approach is validated in simulation and experimentally on a physical soft robotic trunk. Results show that our method scales to systems with up to 1,000 nodes at 100 Hz in closed-loop, and demonstrates real-time reference tracking on hardware with sub-centimeter accuracy, outperforming baselines by 63.6%. Finally, we show the capability of our method to achieve effective full-body obstacle avoidance.
Chinese Translation
高维系统的控制,如软机器人,需要能够真实捕捉复杂动态的模型,同时保持计算上的可行性。本研究提出了一个框架,将基于图神经网络(Graph Neural Network, GNN)的动态模型与结构利用的模型预测控制(Model Predictive Control, MPC)相结合,以实现高维系统的实时控制。通过将系统表示为具有局部交互的图,GNN 保持了稀疏性,而定制的压缩算法则从控制问题中消除了状态变量,确保了计算的高效性。我们的压缩算法的复杂度与系统节点的数量呈线性关系,并利用图形处理单元(Graphics Processing Unit, GPU)的并行化实现实时性能。所提出的方法在仿真和物理软机器人躯干的实验中得到了验证。结果表明,我们的方法能够在闭环中扩展到最多 1,000 个节点以 100 Hz 的频率,并在硬件上实现亚厘米精度的实时参考跟踪,超越基线性能 63.6%。最后,我们展示了我们的方法在有效的全身障碍物规避方面的能力。
计算机视觉 (Computer Vision)
50
cs.CV / 1 / 2602.16713

Three-dimensional Damage Visualization of Civil Structures via Gaussian Splatting-enabled Digital Twins

通过高斯喷溅技术实现土木结构的三维损伤可视化
Wang, Shuo, Wang, Shuo, Nie, Xin, Narazaki, Yasutaka, Matiki, Thomas, Spencer Jr, Billie F.
Abstract
Recent advancements in civil infrastructure inspections underscore the need for precise three-dimensional (3D) damage visualization on digital twins, transcending traditional 2D image-based damage identifications. Compared to conventional photogrammetric 3D reconstruction techniques, modern approaches such as Neural Radiance Field (NeRF) and Gaussian Splatting (GS) excel in scene representation, rendering quality, and handling featureless regions. Among them, GS stands out for its efficiency, leveraging discrete anisotropic 3D Gaussians to represent radiance fields, unlike NeRF's continuous implicit model. This study introduces a GS-enabled digital twin method tailored for effective 3D damage visualization. The method's key contributions include: 1) utilizing GS-based 3D reconstruction to visualize 2D damage segmentation results while reducing segmentation errors; 2) developing a multi-scale reconstruction strategy to balance efficiency and damage detail; 3) enabling digital twin updates as damage evolves over time. Demonstrated on an open-source synthetic dataset for post-earthquake inspections, the proposed approach offers a promising solution for comprehensive 3D damage visualization in civil infrastructure digital twins.
Chinese Translation
近年来,土木基础设施检查的进展凸显了在数字双胞胎上进行精确三维(3D)损伤可视化的必要性,这超越了传统的基于二维图像的损伤识别。与传统的摄影测量3D重建技术相比,现代方法如神经辐射场(Neural Radiance Field, NeRF)和高斯喷溅(Gaussian Splatting, GS)在场景表示、渲染质量和处理无特征区域方面表现优越。其中,GS因其高效性而脱颖而出,利用离散各向异性的3D高斯函数来表示辐射场,而不是NeRF的连续隐式模型。本研究提出了一种基于GS的数字双胞胎方法,旨在有效实现3D损伤可视化。该方法的主要贡献包括:1)利用基于GS的3D重建可视化二维损伤分割结果,同时减少分割错误;2)开发多尺度重建策略以平衡效率与损伤细节;3)随着损伤随时间演变,支持数字双胞胎的更新。该方法在一个开源合成数据集上进行了展示,适用于地震后检查,提供了一种有前景的解决方案,以实现土木基础设施数字双胞胎中的全面3D损伤可视化。
cs.CV / 2 / 2602.16856

Analytic Score Optimization for Multi Dimension Video Quality Assessment

多维视频质量评估的解析评分优化
Lin, Boda, Zhu, Yongjie, Qin, Wenyu, Wang, Meng, Wan, Pengfei
Abstract
Video Quality Assessment (VQA) is evolving beyond single-number mean opinion score toward richer, multi-faceted evaluations of video content. In this paper, we present a large-scale multi-dimensional VQA dataset UltraVQA that encompasses diverse User-Generated Content~(UGC) annotated across five key quality dimensions: Motion Quality, Motion Amplitude, Aesthetic Quality, Content Quality, and Clarity Quality. Each video in our dataset is scored by over 3 human raters on these dimensions, with fine-grained sub-attribute labels, and accompanied by an explanatory rationale generated by GPT based on the collective human judgments. To better leverage these rich annotations and improve discrete quality score assessment, we introduce Analytic Score Optimization (ASO), a theoretically grounded post-training objective derived for multi-dimensional VQA. By reframing quality assessment as a regularized decision-making process, we obtain a closed-form solution that naturally captures the ordinal nature of human ratings, ensuring alignment with human ranking preferences. In experiments, our method outperforms most baselines including closed-source APIs and open-source models, while also reducing mean absolute error (MAE) in quality prediction. Our work highlights the importance of multi-dimensional, interpretable annotations and reinforcement-based alignment in advancing video quality assessment.
Chinese Translation
视频质量评估(VQA)正从单一的平均意见分数向更丰富、多维度的视频内容评估发展。本文提出了一个大规模的多维VQA数据集UltraVQA,该数据集涵盖了多样的用户生成内容(UGC),并在五个关键质量维度上进行了注释:运动质量、运动幅度、美学质量、内容质量和清晰度质量。我们数据集中的每个视频均由超过3名人工评审者在这些维度上进行评分,并附有细致的子属性标签,以及基于集体人类判断生成的解释性理由。为了更好地利用这些丰富的注释并改善离散质量评分评估,我们引入了解析评分优化(Analytic Score Optimization, ASO),这是一种基于理论的后训练目标,旨在多维VQA中应用。通过将质量评估重新框定为一个正则化的决策过程,我们获得了一个封闭形式的解决方案,自然捕捉了人类评分的序数特性,确保与人类排名偏好的对齐。在实验中,我们的方法在质量预测中超越了大多数基线,包括封闭源API和开源模型,同时减少了平均绝对误差(MAE)。我们的工作强调了多维度、可解释注释和基于强化的对齐在推动视频质量评估中的重要性。
cs.CV / 3 / 2602.16872

DODO: Discrete OCR Diffusion Models

DODO:离散OCR扩散模型
Man, Sean, Ganz, Roy, Ronen, Roi, Tsiper, Shahar, Mazor, Shai, Nayman, Niv
Abstract
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
Chinese Translation
光学字符识别(OCR)是信息数字化的基础任务,是视觉数据与文本理解之间的重要桥梁。尽管现代视觉-语言模型(VLM)在这一领域取得了高准确率,但它们主要依赖自回归解码,这在处理长文档时变得计算成本高昂且速度缓慢,因为每生成一个标记都需要进行顺序前向传播。我们识别出一个关键机会来克服这一瓶颈:与开放式生成不同,OCR是一项高度确定性的任务,其中视觉输入严格决定了唯一的输出序列,理论上可以通过扩散模型实现高效的并行解码。然而,我们展示了现有的掩蔽扩散模型未能充分利用这一潜力;这些模型引入的结构不稳定性在灵活任务(如图像描述)中是无害的,但对于OCR的严格精确匹配要求则是灾难性的。为了解决这一问题,我们提出了DODO,这是第一个利用块离散扩散并释放其在OCR中加速潜力的VLM。通过将生成过程分解为块,DODO减轻了全局扩散的同步误差。实证结果表明,我们的方法在实现接近最先进的准确率的同时,相较于自回归基线实现了高达3倍的推理速度提升。
cs.CV / 4 / 2602.16915

StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

StereoAdapter-2:全球结构一致的水下立体深度估计
Ren, Zeyu, Li, Xiang, Wang, Yiran, Zhang, Zeyu, Tang, Hao
Abstract
Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter-2. Website: https://aigeeksgroup.github.io/StereoAdapter-2.
Chinese Translation
立体深度估计是水下机器人感知的基础,但由于波长依赖的光衰减、散射和折射,面临严重的领域转移问题。近期的方法利用基于单目基础模型的GRU(门控循环单元)迭代精炼进行水下适应;然而,GRU中的顺序门控和局部卷积核需要多次迭代才能实现远程视差传播,这限制了在大视差和无纹理水下区域的性能。在本文中,我们提出了StereoAdapter-2,它用基于选择性状态空间模型的新型ConvSS2D操作符替代了传统的ConvGRU更新器。所提出的操作符采用四向扫描策略,自然与极线几何对齐,同时捕捉垂直结构一致性,使得在单次更新步骤中以线性计算复杂度实现高效的远程空间传播。此外,我们构建了UW-StereoDepth-80K,这是一个大规模合成水下立体数据集,具有多样的基线、衰减系数和散射参数,通过结合语义感知风格迁移和几何一致的新视图合成的两阶段生成管道而成。结合从StereoAdapter继承的动态LoRA适应,我们的框架在水下基准测试中实现了最先进的零-shot性能,在TartanAir-UW上提高了17%,在SQUID上提高了7.2%,并在BlueROV2平台上的实际验证展示了我们方法的鲁棒性。代码:https://github.com/AIGeeksGroup/StereoAdapter-2。网站:https://aigeeksgroup.github.io/StereoAdapter-2。
cs.CV / 5 / 2602.16917

SemCovNet: Towards Fair and Semantic Coverage-Aware Learning for Underrepresented Visual Concepts

SemCovNet:面向公平和语义覆盖感知学习的不足代表视觉概念
Ahammed, Sakib, Cui, Xia, Fan, Xinqi, Lu, Wenqi, Yap, Moi Hoon
Abstract
Modern vision models increasingly rely on rich semantic representations that extend beyond class labels to include descriptive concepts and contextual attributes. However, existing datasets exhibit Semantic Coverage Imbalance (SCI), a previously overlooked bias arising from the long-tailed semantic representations. Unlike class imbalance, SCI occurs at the semantic level, affecting how models learn and reason about rare yet meaningful semantics. To mitigate SCI, we propose Semantic Coverage-Aware Network (SemCovNet), a novel model that explicitly learns to correct semantic coverage disparities. SemCovNet integrates a Semantic Descriptor Map (SDM) for learning semantic representations, a Descriptor Attention Modulation (DAM) module that dynamically weights visual and concept features, and a Descriptor-Visual Alignment (DVA) loss that aligns visual features with descriptor semantics. We quantify semantic fairness using a Coverage Disparity Index (CDI), which measures the alignment between coverage and error. Extensive experiments across multiple datasets demonstrate that SemCovNet enhances model reliability and substantially reduces CDI, achieving fairer and more equitable performance. This work establishes SCI as a measurable and correctable bias, providing a foundation for advancing semantic fairness and interpretable vision learning.
Chinese Translation
现代视觉模型越来越依赖于丰富的语义表示,这些表示超越了类别标签,包含描述性概念和上下文属性。然而,现有数据集表现出语义覆盖不平衡(Semantic Coverage Imbalance, SCI),这是一种因长尾语义表示而产生的先前被忽视的偏差。与类别不平衡不同,SCI发生在语义层面,影响模型对稀有但重要语义的学习和推理。为了缓解SCI,我们提出了语义覆盖感知网络(Semantic Coverage-Aware Network, SemCovNet),这是一个新颖的模型,明确学习纠正语义覆盖差异。SemCovNet集成了语义描述图(Semantic Descriptor Map, SDM)用于学习语义表示,一个描述符注意力调制(Descriptor Attention Modulation, DAM)模块动态加权视觉和概念特征,以及一个描述符-视觉对齐(Descriptor-Visual Alignment, DVA)损失,用于将视觉特征与描述符语义对齐。我们使用覆盖差异指数(Coverage Disparity Index, CDI)量化语义公平性,该指数测量覆盖与错误之间的对齐。通过在多个数据集上的广泛实验,SemCovNet提高了模型的可靠性,并显著降低了CDI,实现了更公平和更平等的性能。这项工作将SCI确立为一种可测量和可纠正的偏差,为推动语义公平性和可解释的视觉学习奠定了基础。
cs.CV / 6 / 2602.16918

Xray-Visual Models: Scaling Vision models on Industry Scale Data

Xray-Visual模型:在工业规模数据上扩展视觉模型
Mishra, Shlok, Lin, Tsung-Yu, Wang, Linda, Xu, Hongli, Liu, Yimin, Hsu, Michael, Ahuja, Chaitanya, Yuan, Hao, Cheng, Jianpeng, Chen, Hong-You, Xu, Haoyuan, Li, Chao, Awasthi, Abhijeet, Moon, Jihye, Husa, Don, Ge, Michael, Singla, Sumedha, Chowdhury, Arkabandhu, Dingh, Phong, Shukla, Satya Narayan, Yang, Yonghuan, Jacobs, David, Guo, Qi, Xiao, Jun, Fan, Xiangjun, Singh, Aashu
Abstract
We present Xray-Visual, a unified vision model architecture for large-scale image and video understanding trained on industry-scale social media data. Our model leverages over 15 billion curated image-text pairs and 10 billion video-hashtag pairs from Facebook and Instagram, employing robust data curation pipelines that incorporate balancing and noise suppression strategies to maximize semantic diversity while minimizing label noise. We introduce a three-stage training pipeline that combines self-supervised MAE, semi-supervised hashtag classification, and CLIP-style contrastive learning to jointly optimize image and video modalities. Our architecture builds on a Vision Transformer backbone enhanced with efficient token reorganization (EViT) for improved computational efficiency. Extensive experiments demonstrate that Xray-Visual achieves state-of-the-art performance across diverse benchmarks, including ImageNet for image classification, Kinetics and HMDB51 for video understanding, and MSCOCO for cross-modal retrieval. The model exhibits strong robustness to domain shift and adversarial perturbations. We further demonstrate that integrating large language models as text encoders (LLM2CLIP) significantly enhances retrieval performance and generalization capabilities, particularly in real-world environments. Xray-Visual establishes new benchmarks for scalable, multimodal vision models, while maintaining superior accuracy and computational efficiency.
Chinese Translation
我们提出了Xray-Visual,一种统一的视觉模型架构,用于在工业规模的社交媒体数据上进行大规模图像和视频理解。我们的模型利用了来自Facebook和Instagram的超过150亿对精心策划的图像-文本对和100亿对视频-标签对,采用了强大的数据策划流程,结合了平衡和噪声抑制策略,以最大化语义多样性,同时最小化标签噪声。我们引入了一个三阶段的训练流程,结合了自监督的MAE、半监督的标签分类和CLIP风格的对比学习,以共同优化图像和视频模态。我们的架构基于视觉变换器(Vision Transformer)主干,并增强了高效的令牌重组(EViT),以提高计算效率。大量实验表明,Xray-Visual在多个基准测试中实现了最先进的性能,包括用于图像分类的ImageNet、用于视频理解的Kinetics和HMDB51,以及用于跨模态检索的MSCOCO。该模型对领域转移和对抗扰动表现出强大的鲁棒性。我们进一步证明,将大型语言模型作为文本编码器(LLM2CLIP)集成显著提高了检索性能和泛化能力,特别是在现实世界环境中。Xray-Visual为可扩展的多模态视觉模型建立了新的基准,同时保持了卓越的准确性和计算效率。
cs.CV / 7 / 2602.16950

HS-3D-NeRF: 3D Surface and Hyperspectral Reconstruction From Stationary Hyperspectral Images Using Multi-Channel NeRFs

HS-3D-NeRF:基于静态高光谱图像的3D表面和高光谱重建,采用多通道NeRF
Ku, Kibon, Jubery, Talukder Z., Krishnamurthy, Adarsh, Ganapathysubramanian, Baskar
Abstract
Advances in hyperspectral imaging (HSI) and 3D reconstruction have enabled accurate, high-throughput characterization of agricultural produce quality and plant phenotypes, both essential for advancing agricultural sustainability and breeding programs. HSI captures detailed biochemical features of produce, while 3D geometric data substantially improves morphological analysis. However, integrating these two modalities at scale remains challenging, as conventional approaches involve complex hardware setups incompatible with automated phenotyping systems. Recent advances in neural radiance fields (NeRF) offer computationally efficient 3D reconstruction but typically require moving-camera setups, limiting throughput and reproducibility in standard indoor agricultural environments. To address these challenges, we introduce HSI-SC-NeRF, a stationary-camera multi-channel NeRF framework for high-throughput hyperspectral 3D reconstruction targeting postharvest inspection of agricultural produce. Multi-view hyperspectral data is captured using a stationary camera while the object rotates within a custom-built Teflon imaging chamber providing diffuse, uniform illumination. Object poses are estimated via ArUco calibration markers and transformed to the camera frame of reference through simulated pose transformations, enabling standard NeRF training on stationary-camera data. A multi-channel NeRF formulation optimizes reconstruction across all hyperspectral bands jointly using a composite spectral loss, supported by a two-stage training protocol that decouples geometric initialization from radiometric refinement. Experiments on three agricultural produce samples demonstrate high spatial reconstruction accuracy and strong spectral fidelity across the visible and near-infrared spectrum, confirming the suitability of HSI-SC-NeRF for integration into automated agricultural workflows.
Chinese Translation
高光谱成像(HSI)和3D重建的进展使得农业产品质量和植物表型的准确、高通量表征成为可能,这对于推动农业可持续性和育种项目至关重要。HSI捕捉了产品的详细生化特征,而3D几何数据显著改善了形态分析。然而,在规模上集成这两种模式仍然具有挑战性,因为传统方法涉及复杂的硬件设置,与自动表型系统不兼容。最近在神经辐射场(NeRF)方面的进展提供了计算上高效的3D重建,但通常需要移动相机设置,限制了在标准室内农业环境中的通量和可重复性。为了解决这些挑战,我们提出了HSI-SC-NeRF,一个静态相机多通道NeRF框架,旨在实现高通量的高光谱3D重建,针对农业产品的采后检测。使用静态相机捕获多视角高光谱数据,同时物体在定制的特氟龙成像室内旋转,提供均匀的漫射照明。通过ArUco校准标记估计物体姿态,并通过模拟姿态变换将其转换到相机参考框架,使得在静态相机数据上进行标准NeRF训练成为可能。多通道NeRF公式通过复合光谱损失联合优化所有高光谱波段的重建,支持一个两阶段的训练协议,将几何初始化与辐射精化解耦。对三种农业产品样本的实验表明,在可见光和近红外光谱范围内具有高空间重建精度和强光谱保真度,确认了HSI-SC-NeRF适合集成到自动化农业工作流程中。
cs.CV / 8 / 2602.16968

DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers

DDiT:用于高效扩散变换器的动态补丁调度
Kim, Dahye, Ghadiyaram, Deepti, Gadde, Raghudeep
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.
Chinese Translation
扩散变换器(Diffusion Transformers, DiTs)在图像和视频生成方面取得了最先进的性能,但其成功的代价是巨大的计算开销。这种低效主要源于固定的标记化过程,该过程在整个去噪阶段使用恒定大小的补丁,而不考虑内容的复杂性。我们提出了动态标记化,这是一种高效的测试时间策略,根据内容复杂性和去噪时间步变化补丁大小。我们的关键见解是,早期时间步只需要较粗的补丁来建模全局结构,而后期迭代则需要更细(小尺寸)的补丁来细化局部细节。在推理过程中,我们的方法在图像和视频生成的去噪步骤中动态重新分配补丁大小,显著降低了成本,同时保持感知生成质量。大量实验表明我们的方法的有效性:在FLUX-1.Dev和Wan $2.1$上分别实现了高达$3.52 imes$和$3.2 imes$的加速,而不影响生成质量和提示遵循性。
cs.CV / 9 / 2602.16979

Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

通过监督潜变量建模表征模态的预测影响
Madaan, Divyam, Chopra, Sumit, Cho, Kyunghyun
Abstract
Despite the recent success of Multimodal Large Language Models (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.
Chinese Translation
尽管多模态大型语言模型(MLLMs)最近取得了成功,但现有方法主要假设在训练和推理过程中可以获得多种模态。实际上,多模态数据往往是不完整的,因为模态可能缺失、异步收集,或者仅对一部分示例可用。在本研究中,我们提出了PRIMO,一种监督潜变量插补模型,用于量化在多模态学习环境中任何缺失模态的预测影响。PRIMO使得可以使用所有可用的训练示例,无论模态是完整还是部分。具体而言,它通过一个潜变量来建模缺失模态,该潜变量捕捉其与观察到的模态在预测上下文中的关系。在推理过程中,我们从学习到的缺失模态分布中抽取多个样本,以获得边际预测分布(用于预测目的)并分析缺失模态对每个实例预测的影响。我们在合成XOR数据集、音频-视觉MNIST和MIMIC-III(用于死亡率和ICD-9预测)上评估了PRIMO。在所有数据集中,当某一模态完全缺失时,PRIMO的性能与单模态基线相当,而当所有模态可用时,其性能与多模态基线相当。PRIMO使用基于方差的度量量化模态在实例级别的预测影响,该度量是通过对潜在补全的预测计算得出的。我们直观地展示了缺失模态的不同补全如何导致一组合理的标签。
cs.CV / 10 / 2602.17030

Patch-Based Spatial Authorship Attribution in Human-Robot Collaborative Paintings

基于补丁的空间作者归属在人机协作绘画中的应用
Chen, Eric, Alves-Oliveira, Patricia
Abstract
As agentic AI becomes increasingly involved in creative production, documenting authorship has become critical for artists, collectors, and legal contexts. We present a patch-based framework for spatial authorship attribution within human-robot collaborative painting practice, demonstrated through a forensic case study of one human artist and one robotic system across 15 abstract paintings. Using commodity flatbed scanners and leave-one-painting-out cross-validation, the approach achieves 88.8% patch-level accuracy (86.7% painting-level via majority vote), outperforming texture-based and pretrained-feature baselines (68.0%-84.7%). For collaborative artworks, where ground truth is inherently ambiguous, we use conditional Shannon entropy to quantify stylistic overlap; manually annotated hybrid regions exhibit 64% higher uncertainty than pure paintings (p=0.003), suggesting the model detects mixed authorship rather than classification failure. The trained model is specific to this human-robot pair but provides a methodological grounding for sample-efficient attribution in data-scarce human-AI creative workflows that, in the future, has the potential to extend authorship attribution to any human-robot collaborative painting.
Chinese Translation
随着具备代理能力的人工智能在创作生产中日益参与,记录作者身份对艺术家、收藏家和法律环境变得至关重要。我们提出了一种基于补丁的框架,用于在人机协作绘画实践中进行空间作者归属,通过对一位人类艺术家和一个机器人系统在15幅抽象画作中的法医案例研究进行演示。该方法利用商品平板扫描仪和留一幅画交叉验证,达到了88.8%的补丁级准确率(通过多数投票实现86.7%的画作级准确率),优于基于纹理和预训练特征的基线(68.0%-84.7%)。对于协作艺术作品,由于真实情况本质上模糊,我们使用条件香农熵来量化风格重叠;手动标注的混合区域的不确定性比纯画作高出64%(p=0.003),这表明模型检测到的是混合作者身份,而非分类失败。训练后的模型特定于这对人类与机器人,但为在数据稀缺的人机创作工作流程中实现样本高效的作者归属提供了方法论基础,未来有潜力将作者归属扩展到任何人机协作绘画中。
cs.CV / 11 / 2602.17033

PartRAG: Retrieval-Augmented Part-Level 3D Generation and Editing

PartRAG:检索增强的部件级3D生成与编辑
Li, Peize, Zhang, Zeyu, Tang, Hao
Abstract
Single-image 3D generation with part-level structure remains challenging: learned priors struggle to cover the long tail of part geometries and maintain multi-view consistency, and existing systems provide limited support for precise, localized edits. We present PartRAG, a retrieval-augmented framework that integrates an external part database with a diffusion transformer to couple generation with an editable representation. To overcome the first challenge, we introduce a Hierarchical Contrastive Retrieval module that aligns dense image patches with 3D part latents at both part and object granularity, retrieving from a curated bank of 1,236 part-annotated assets to inject diverse, physically plausible exemplars into denoising. To overcome the second challenge, we add a masked, part-level editor that operates in a shared canonical space, enabling swaps, attribute refinements, and compositional updates without regenerating the whole object while preserving non-target parts and multi-view consistency. PartRAG achieves competitive results on Objaverse, ShapeNet, and ABO-reducing Chamfer Distance from 0.1726 to 0.1528 and raising F-Score from 0.7472 to 0.844 on Objaverse-with inference of 38s and interactive edits in 5-8s. Qualitatively, PartRAG produces sharper part boundaries, better thin-structure fidelity, and robust behavior on articulated objects. Code: https://github.com/AIGeeksGroup/PartRAG. Website: https://aigeeksgroup.github.io/PartRAG.
Chinese Translation
单图像的部件级结构3D生成仍然面临挑战:学习的先验难以覆盖部件几何形状的长尾,并保持多视图一致性,现有系统对精确的局部编辑支持有限。我们提出了PartRAG,一个检索增强的框架,它将外部部件数据库与扩散变换器相结合,以将生成与可编辑表示相结合。为了解决第一个挑战,我们引入了一个层次对比检索模块,该模块在部件和对象粒度上对齐密集图像块与3D部件潜变量,从1,236个部件注释资产的策划库中检索,以将多样的、物理上合理的示例注入去噪过程。为了解决第二个挑战,我们添加了一个在共享规范空间中操作的掩蔽部件级编辑器,使得在不重新生成整个对象的情况下能够进行部件交换、属性细化和组合更新,同时保持非目标部件和多视图一致性。PartRAG在Objaverse、ShapeNet上取得了竞争性结果,并将ABO的Chamfer距离从0.1726降低到0.1528,F-Score从0.7472提高到0.844,推理时间为38秒,交互编辑时间为5-8秒。从定性上看,PartRAG产生了更清晰的部件边界、更好的细结构保真度以及在关节物体上的稳健表现。代码:https://github.com/AIGeeksGroup/PartRAG。网站:https://aigeeksgroup.github.io/PartRAG。
cs.CV / 12 / 2602.17047

Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers

Amber-Image:大规模扩散变换器的高效压缩
Yang, Chaojie, Li, Tian, Zhang, Yue, Gao, Jun
Abstract
Diffusion Transformer (DiT) architectures have significantly advanced Text-to-Image (T2I) generation but suffer from prohibitive computational costs and deployment barriers. To address these challenges, we propose an efficient compression framework that transforms the 60-layer dual-stream MMDiT-based Qwen-Image into lightweight models without training from scratch. Leveraging this framework, we introduce Amber-Image, a series of streamlined T2I models. We first derive Amber-Image-10B using a timestep-sensitive depth pruning strategy, where retained layers are reinitialized via local weight averaging and optimized through layer-wise distillation and full-parameter fine-tuning. Building on this, we develop Amber-Image-6B by introducing a hybrid-stream architecture that converts deep-layer dual streams into a single stream initialized from the image branch, further refined via progressive distillation and lightweight fine-tuning. Our approach reduces parameters by 70% and eliminates the need for large-scale data engineering. Notably, the entire compression and training pipeline-from the 10B to the 6B variant-requires fewer than 2,000 GPU hours, demonstrating exceptional cost-efficiency compared to training from scratch. Extensive evaluations on benchmarks like DPG-Bench and LongText-Bench show that Amber-Image achieves high-fidelity synthesis and superior text rendering, matching much larger models.
Chinese Translation
扩散变换器(Diffusion Transformer,DiT)架构在文本到图像(Text-to-Image,T2I)生成方面取得了显著进展,但面临着高昂的计算成本和部署障碍。为了解决这些挑战,我们提出了一种高效的压缩框架,将基于60层双流MMDiT的Qwen-Image转化为轻量级模型,而无需从头开始训练。利用该框架,我们引入了Amber-Image,一系列精简的T2I模型。我们首先采用时间步敏感的深度剪枝策略推导出Amber-Image-10B,其中保留的层通过局部权重平均重新初始化,并通过逐层蒸馏和全参数微调进行优化。在此基础上,我们通过引入混合流架构开发了Amber-Image-6B,该架构将深层双流转换为从图像分支初始化的单流,并通过渐进蒸馏和轻量级微调进一步优化。我们的方法将参数减少了70%,并消除了对大规模数据工程的需求。值得注意的是,从10B到6B变体的整个压缩和训练流程所需的GPU小时数不到2,000,显示出与从头训练相比的卓越成本效率。在DPG-Bench和LongText-Bench等基准上的广泛评估表明,Amber-Image实现了高保真合成和优越的文本渲染,能够与更大模型相媲美。
cs.CV / 13 / 2602.17048

StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

StructCore:一种结构感知的图像级评分方法,用于无训练的无监督异常检测
Chae, Joongwon, Luo, Lihui, Liu, Yang, Wang, Runming, Yu, Dongmei, Liang, Zeming, Yuan, Xi, Zhang, Dayan, Chen, Zhenglin, Qin, Peiwu, Chae, Ilmoon
Abstract
Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.
Chinese Translation
最大池化是基于记忆库的无监督异常检测(UAD)中将异常得分图转换为图像级决策的事实标准。然而,由于它依赖于单一的极端响应,它丢弃了关于异常证据在图像中如何分布和结构化的大部分信息,常常导致正常和异常得分重叠。我们提出了StructCore,这是一种无训练的、结构感知的图像级评分方法,超越了最大池化。给定一个异常得分图,StructCore计算一个低维结构描述符 phi(S),捕捉分布和空间特征,并通过从训练良好样本中估计的对角马哈拉诺比斯校准来细化图像级评分,而不修改像素级定位。StructCore在MVTec AD上实现了99.6%的图像级AUROC得分,在VisA上实现了98.4%的得分,展示了通过利用最大池化所忽略的结构特征实现的稳健图像级异常检测。
cs.CV / 14 / 2602.17060

Cholec80-port: A Geometrically Consistent Trocar Port Segmentation Dataset for Robust Surgical Scene Understanding

Cholec80-port:一个几何一致的 trocar 端口分割数据集,用于稳健的外科场景理解
Kikuchi, Shunsuke, Kouno, Atsushi, Matsuzaki, Hiroki
Abstract
Trocar ports are camera-fixed, pseudo-static structures that can persistently occlude laparoscopic views and attract disproportionate feature points due to specular, textured surfaces. This makes ports particularly detrimental to geometry-based downstream pipelines such as image stitching, 3D reconstruction, and visual SLAM, where dynamic or non-anatomical outliers degrade alignment and tracking stability. Despite this practical importance, explicit port labels are rare in public surgical datasets, and existing annotations often violate geometric consistency by masking the central lumen (opening), even when anatomical regions are visible through it. We present Cholec80-port, a high-fidelity trocar port segmentation dataset derived from Cholec80, together with a rigorous standard operating procedure (SOP) that defines a port-sleeve mask excluding the central opening. We additionally cleanse and unify existing public datasets under the same SOP. Experiments demonstrate that geometrically consistent annotations substantially improve cross-dataset robustness beyond what dataset size alone provides.
Chinese Translation
Trocar 端口是固定于摄像机的伪静态结构,能够持续遮挡腹腔镜视图,并由于其镜面和纹理表面吸引不成比例的特征点。这使得端口对基于几何的下游处理流程(如图像拼接、3D 重建和视觉 SLAM)尤其有害,因为动态或非解剖学的异常值会降低对齐和跟踪的稳定性。尽管这一实际问题的重要性不言而喻,但在公共外科数据集中,明确的端口标签却很少,现有的注释往往通过遮蔽中央腔道(开口)来违反几何一致性,即使在其后面可见解剖区域。我们提出了 Cholec80-port,一个高保真度的 trocar 端口分割数据集,源自 Cholec80,并提供了一套严格的标准操作程序(SOP),定义了一个排除中央开口的端口套管掩膜。此外,我们还对现有公共数据集进行了清理和统一,使其符合相同的 SOP。实验表明,几何一致的注释显著提高了跨数据集的鲁棒性,超出了仅依赖数据集规模所能提供的效果。
cs.CV / 15 / 2602.17077

Cross Pseudo Labeling For Weakly Supervised Video Anomaly Detection

跨伪标签技术用于弱监督视频异常检测
Dayeon, Lee, Dongheyong, Kim, Chaewon, Park, Sungmin, Woo, Sangyoun, Lee
Abstract
Weakly supervised video anomaly detection aims to detect anomalies and identify abnormal categories with only video-level labels. We propose CPL-VAD, a dual-branch framework with cross pseudo labeling. The binary anomaly detection branch focuses on snippet-level anomaly localization, while the category classification branch leverages vision-language alignment to recognize abnormal event categories. By exchanging pseudo labels, the two branches transfer complementary strengths, combining temporal precision with semantic discrimination. Experiments on XD-Violence and UCF-Crime demonstrate that CPL-VAD achieves state-of-the-art performance in both anomaly detection and abnormal category classification.
Chinese Translation
弱监督视频异常检测旨在仅通过视频级标签检测异常并识别异常类别。我们提出了CPL-VAD,一个具有跨伪标签的双分支框架。二元异常检测分支专注于片段级异常定位,而类别分类分支则利用视觉-语言对齐来识别异常事件类别。通过交换伪标签,这两个分支互相传递互补优势,将时间精度与语义区分相结合。在XD-Violence和UCF-Crime数据集上的实验表明,CPL-VAD在异常检测和异常类别分类方面均实现了最先进的性能。
cs.CV / 16 / 2602.17085

ComptonUNet: A Deep Learning Model for GRB Localization with Compton Cameras under Noisy and Low-Statistic Conditions

ComptonUNet:一种用于在噪声和低统计条件下利用康普顿相机进行伽马射线暴定位的深度学习模型
Sato, Shogo, Tanaka, Kazuo, Ogasawara, Shojun, Yamamoto, Kazuki, Murasaki, Kazuhiko, Tanida, Ryuichi, Kataoka, Jun
Abstract
Gamma-ray bursts (GRBs) are among the most energetic transient phenomena in the universe and serve as powerful probes for high-energy astrophysical processes. In particular, faint GRBs originating from a distant universe may provide unique insights into the early stages of star formation. However, detecting and localizing such weak sources remains challenging owing to low photon statistics and substantial background noise. Although recent machine learning models address individual aspects of these challenges, they often struggle to balance the trade-off between statistical robustness and noise suppression. Consequently, we propose ComptonUNet, a hybrid deep learning framework that jointly processes raw data and reconstructs images for robust GRB localization. ComptonUNet was designed to operate effectively under conditions of limited photon statistics and strong background contamination by combining the statistical efficiency of direct reconstruction models with the denoising capabilities of image-based architectures. We perform realistic simulations of GRB-like events embedded in background environments representative of low-Earth orbit missions to evaluate the performance of ComptonUNet. Our results demonstrate that ComptonUNet significantly outperforms existing approaches, achieving improved localization accuracy across a wide range of low-statistic and high-background scenarios.
Chinese Translation
伽马射线暴(GRBs)是宇宙中最具能量的瞬态现象之一,是高能天体物理过程的重要探测工具。尤其是来自遥远宇宙的微弱GRBs,可能为星形成早期阶段提供独特的见解。然而,由于光子统计低和背景噪声显著,检测和定位这些微弱源仍然具有挑战性。尽管近期的机器学习模型解决了这些挑战的个别方面,但它们往往难以平衡统计稳健性与噪声抑制之间的权衡。因此,我们提出了ComptonUNet,一种混合深度学习框架,能够联合处理原始数据并重建图像,以实现稳健的GRB定位。ComptonUNet旨在有效应对有限光子统计和强背景污染的条件,通过结合直接重建模型的统计效率与基于图像的架构的去噪能力。我们对嵌入在代表低地轨道任务的背景环境中的GRB类事件进行了现实模拟,以评估ComptonUNet的性能。我们的结果表明,ComptonUNet显著优于现有方法,在广泛的低统计和高背景场景中实现了更好的定位精度。
cs.CV / 17 / 2602.17124

3D Scene Rendering with Multimodal Gaussian Splatting

基于多模态高斯溅射的3D场景渲染
Gau, Chi-Shiang, Polyzos, Konstantinos D., Bacharis, Athanasios, Madhuvarasu, Saketh, Javidi, Tara
Abstract
3D scene reconstruction and rendering are core tasks in computer vision, with applications spanning industrial monitoring, robotics, and autonomous driving. Recent advances in 3D Gaussian Splatting (GS) and its variants have achieved impressive rendering fidelity while maintaining high computational and memory efficiency. However, conventional vision-based GS pipelines typically rely on a sufficient number of camera views to initialize the Gaussian primitives and train their parameters, typically incurring additional processing cost during initialization while falling short in conditions where visual cues are unreliable, such as adverse weather, low illumination, or partial occlusions. To cope with these challenges, and motivated by the robustness of radio-frequency (RF) signals to weather, lighting, and occlusions, we introduce a multimodal framework that integrates RF sensing, such as automotive radar, with GS-based rendering as a more efficient and robust alternative to vision-only GS rendering. The proposed approach enables efficient depth prediction from only sparse RF-based depth measurements, yielding a high-quality 3D point cloud for initializing Gaussian functions across diverse GS architectures. Numerical tests demonstrate the merits of judiciously incorporating RF sensing into GS pipelines, achieving high-fidelity 3D scene rendering driven by RF-informed structural accuracy.
Chinese Translation
3D场景重建和渲染是计算机视觉中的核心任务,应用范围涵盖工业监测、机器人技术和自动驾驶。最近在3D高斯溅射(Gaussian Splatting,GS)及其变体方面的进展,实现了令人印象深刻的渲染保真度,同时保持了高效的计算和内存使用。然而,传统的基于视觉的GS流程通常依赖于足够数量的相机视角来初始化高斯原语并训练其参数,这通常在初始化过程中会产生额外的处理成本,并且在视觉线索不可靠的条件下(如恶劣天气、低光照或部分遮挡)表现不佳。为应对这些挑战,并受到无线电频率(Radio-Frequency,RF)信号在天气、光照和遮挡条件下的鲁棒性启发,我们提出了一种多模态框架,将RF传感(如汽车雷达)与基于GS的渲染相结合,作为一种比仅依赖视觉的GS渲染更高效和稳健的替代方案。所提出的方法能够仅通过稀疏的RF深度测量高效预测深度,从而为在多种GS架构中初始化高斯函数生成高质量的3D点云。数值测试证明了将RF传感巧妙地融入GS流程的优点,实现了由RF信息驱动的结构精度的高保真3D场景渲染。
cs.CV / 18 / 2602.17134

B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

B$^3$-Seg:无摄像头、无训练的基于解析EIG和Beta-Bernoulli贝叶斯更新的3DGS分割
Kamata, Hiromichi, Munro, Samuel Arthur, Homma, Fuminori
Abstract
Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.
Chinese Translation
交互式3D高斯溅射(3DGS)分割对于电影和游戏制作中预重建资产的实时编辑至关重要。然而,现有方法依赖于预定义的摄像头视角、真实标签或昂贵的再训练,使其在低延迟使用中不够实用。我们提出了B$^3$-Seg(用于3DGS的Beta-Bernoulli贝叶斯分割),这是一种在无摄像头和无训练条件下进行开放词汇3DGS分割的快速且理论基础扎实的方法。我们的方法将分割重新表述为顺序的Beta-Bernoulli贝叶斯更新,并通过解析的期望信息增益(EIG)主动选择下一个视角。这种贝叶斯表述保证了EIG的自适应单调性和子模性,从而产生对最优视角采样策略的贪婪$(1{-}1/e)$近似。在多个数据集上的实验表明,B$^3$-Seg在几秒钟内实现端到端的分割,且其结果与高成本的监督方法相当。结果表明,B$^3$-Seg使得实用的交互式3DGS分割成为可能,并具备可证明的信息效率。
cs.CV / 19 / 2602.17168

BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning

BadCLIP++:多模态对比学习中的隐蔽和持久后门
Liang, Siyuan, Jing, Yongcheng, Wang, Yingjie, Huang, Jiaxing, Chang, Ee-chien, Tao, Dacheng
Abstract
Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.
Chinese Translation
针对多模态对比学习模型的后门攻击研究面临两个关键挑战:隐蔽性和持久性。现有方法在强检测或持续微调下往往失败,主要由于(1)跨模态不一致性暴露了触发模式,以及(2)在低中毒率下梯度稀释加速了后门遗忘。这些耦合原因尚未得到充分建模和解决。我们提出了BadCLIP++,一个统一框架,旨在解决这两个挑战。为了实现隐蔽性,我们引入了一种语义融合的QR微触发器,该触发器在任务相关区域附近嵌入不可感知的模式,保持干净数据的统计特性,同时产生紧凑的触发分布。我们进一步应用目标对齐的子集选择,以增强低注入率下的信号。为了实现持久性,我们通过半径收缩和质心对齐来稳定触发嵌入,并通过曲率控制和弹性权重整合来稳定模型参数,保持解决方案在一个低曲率的宽盆地内,抵抗微调。我们还提供了首个理论分析,表明在信任区域内,来自干净微调和后门目标的梯度是同向的,从而产生攻击成功率下降的非递增上界。实验表明,仅用0.3%的中毒率,BadCLIP++在数字环境中实现了99.99%的攻击成功率(ASR),超越基线11.4个百分点。在十九种防御方法中,ASR保持在99.90%以上,干净准确率下降不到0.8%。该方法在物理攻击中进一步实现了65.03%的成功率,并显示出对水印去除防御的鲁棒性。
cs.CV / 20 / 2602.17182

NRGS-SLAM: Monocular Non-Rigid SLAM for Endoscopy via Deformation-Aware 3D Gaussian Splatting

NRGS-SLAM:基于变形感知的3D高斯点云的单目非刚性SLAM用于内窥镜检查
Shan, Jiwei, Cai, Zeyu, Li, Yirui, Chen, Yongbo, Han, Lijun, Liu, Yun-hui, Wang, Hesheng, Cheng, Shing Shin
Abstract
Visual simultaneous localization and mapping (V-SLAM) is a fundamental capability for autonomous perception and navigation. However, endoscopic scenes violate the rigidity assumption due to persistent soft-tissue deformations, creating a strong coupling ambiguity between camera ego-motion and intrinsic deformation. Although recent monocular non-rigid SLAM methods have made notable progress, they often lack effective decoupling mechanisms and rely on sparse or low-fidelity scene representations, which leads to tracking drift and limited reconstruction quality. To address these limitations, we propose NRGS-SLAM, a monocular non-rigid SLAM system for endoscopy based on 3D Gaussian Splatting. To resolve the coupling ambiguity, we introduce a deformation-aware 3D Gaussian map that augments each Gaussian primitive with a learnable deformation probability, optimized via a Bayesian self-supervision strategy without requiring external non-rigidity labels. Building on this representation, we design a deformable tracking module that performs robust coarse-to-fine pose estimation by prioritizing low-deformation regions, followed by efficient per-frame deformation updates. A carefully designed deformable mapping module progressively expands and refines the map, balancing representational capacity and computational efficiency. In addition, a unified robust geometric loss incorporates external geometric priors to mitigate the inherent ill-posedness of monocular non-rigid SLAM. Extensive experiments on multiple public endoscopic datasets demonstrate that NRGS-SLAM achieves more accurate camera pose estimation (up to 50\% reduction in RMSE) and higher-quality photo-realistic reconstructions than state-of-the-art methods. Comprehensive ablation studies further validate the effectiveness of our key design choices. Source code will be publicly available upon paper acceptance.
Chinese Translation
视觉同步定位与地图构建(V-SLAM)是自主感知和导航的基本能力。然而,由于持续的软组织变形,内窥镜场景违反了刚性假设,导致相机自我运动与内在变形之间存在强耦合模糊性。尽管最近的单目非刚性SLAM方法取得了显著进展,但它们通常缺乏有效的解耦机制,并依赖稀疏或低保真度的场景表示,这导致了跟踪漂移和有限的重建质量。为了解决这些局限性,我们提出了NRGS-SLAM,一种基于3D高斯点云的单目非刚性SLAM系统,专为内窥镜检查设计。为了消除耦合模糊性,我们引入了一种变形感知的3D高斯地图,为每个高斯原语增加了可学习的变形概率,通过贝叶斯自监督策略进行优化,而无需外部非刚性标签。在此表示的基础上,我们设计了一个可变形跟踪模块,通过优先考虑低变形区域,进行稳健的粗到细姿态估计,随后进行高效的逐帧变形更新。一个精心设计的可变形映射模块逐步扩展和细化地图,平衡表示能力和计算效率。此外,一个统一的鲁棒几何损失结合了外部几何先验,以减轻单目非刚性SLAM固有的不适定性。在多个公共内窥镜数据集上的大量实验表明,NRGS-SLAM在相机姿态估计上实现了更高的准确性(RMSE减少高达50%)和更高质量的照片真实感重建,优于最先进的方法。全面的消融研究进一步验证了我们关键设计选择的有效性。源代码将在论文接受后公开。
cs.CV / 21 / 2602.17186

Selective Training for Large Vision Language Models via Visual Information Gain

通过视觉信息增益对大型视觉语言模型进行选择性训练
Lee, Seulbi, Hwang, Sangheum
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
Chinese Translation
大型视觉语言模型(LVLMs)取得了显著进展,但它们往往受到语言偏见的影响,生成的答案未能依赖于视觉证据。尽管之前的研究尝试通过解码策略、架构修改或精心策划的指令数据来缓解这一问题,但通常缺乏对单个训练样本或标记实际受益于图像的量化衡量。在本研究中,我们引入了视觉信息增益(Visual Information Gain, VIG),这是一种基于困惑度的指标,用于衡量视觉输入所提供的预测不确定性降低。VIG能够在样本和标记层面进行细粒度分析,有效突出视觉基础的元素,如颜色、空间关系和属性。基于此,我们提出了一种VIG引导的选择性训练方案,优先考虑高VIG样本和标记。这种方法改善了视觉基础性,并减轻了语言偏见,通过专注于视觉信息丰富的样本和标记,显著减少监督的同时实现了更优的性能。
cs.CV / 22 / 2602.17196

EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

EntropyPrune:基于矩阵熵引导的多模态大型语言模型视觉标记剪枝
Wang, Yahong, Wu, Juncheng, Ni, Zhangkai, Yang, Chengmei, Liu, Yihang, Yang, Longzhen, Zhou, Yuyin, Wen, Ying, He, Lianghua
Abstract
Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.
Chinese Translation
多模态大型语言模型(MLLMs)由于每张图像处理数百个视觉标记而产生了巨大的推理成本。尽管标记剪枝已被证明对加速推理有效,但确定何时何地进行剪枝仍主要依赖于启发式方法。现有方法通常依赖于静态的、经验选择的层,这限制了模型间的可解释性和可迁移性。在本研究中,我们引入了矩阵熵的视角,并识别出一个“熵崩溃层”(Entropy Collapse Layer, ECL),在该层中,视觉表示的信息内容表现出急剧且一致的下降,这为选择剪枝阶段提供了一个原则性标准。基于这一观察,我们提出了EntropyPrune,一种新颖的基于矩阵熵引导的标记剪枝框架,它量化了单个视觉标记的信息价值,并在不依赖注意力图的情况下剪除冗余标记。此外,为了实现高效计算,我们利用了双Gram矩阵的谱等价性,降低了熵计算的复杂性,并实现了最高64倍的理论加速。在多样化的多模态基准测试中,广泛实验表明,EntropyPrune在准确性和效率上始终优于最先进的剪枝方法。在LLaVA-1.5-7B上,我们的方法实现了68.2%的FLOPs减少,同时保持了96.0%的原始性能。此外,EntropyPrune在高分辨率和基于视频的模型中有效泛化,突显了其在实际MLLM加速中的强大鲁棒性和可扩展性。代码将公开发布在https://github.com/YahongWang1/EntropyPrune。
cs.CV / 23 / 2602.17200

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

GASS:基于几何的球面采样用于文本到图像生成中的解耦多样性增强
Zhu, Ye, Newman, Kaleb S., Lutzeyer, Johannes F., Romero-Soriano, Adriana, Drozdzal, Michal, Russakovsky, Olga
Abstract
Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
Chinese Translation
尽管现代文本到图像(T2I)生成模型在语义对齐方面表现出色,但仍然难以从给定提示合成多样化的图像。这种多样性的缺乏不仅限制了用户选择,还可能加剧社会偏见。在本研究中,我们通过几何视角增强T2I的多样性。与大多数现有方法主要依赖基于熵的指导来增加样本差异性不同,我们引入了几何感知球面采样(Geometry-Aware Spherical Sampling, GASS),通过明确控制与提示相关和与提示无关的变异来源来增强多样性。具体而言,我们使用两个正交方向分解CLIP嵌入中的多样性度量:文本嵌入捕捉与提示相关的语义变异,而识别出的正交方向则捕捉与提示无关的变异(例如背景)。基于这种分解,GASS增加了生成图像嵌入在两个轴上的几何投影扩展,并通过沿生成轨迹扩展预测来引导T2I采样过程。我们在不同的冻结T2I骨干网络(U-Net和DiT,扩散和流)及基准上的实验表明,解耦多样性增强的有效性对图像保真度和语义对齐的影响最小。
cs.CV / 24 / 2602.17231

HiMAP: History-aware Map-occupancy Prediction with Fallback

HiMAP:具备历史感知的地图占用预测与回退机制
Xu, Yiming, Yang, Yi, Cheng, Hao, Sester, Monika
Abstract
Accurate motion forecasting is critical for autonomous driving, yet most predictors rely on multi-object tracking (MOT) with identity association, assuming that objects are correctly and continuously tracked. When tracking fails due to, e.g., occlusion, identity switches, or missed detections, prediction quality degrades and safety risks increase. We present \textbf{HiMAP}, a tracking-free, trajectory prediction framework that remains reliable under MOT failures. HiMAP converts past detections into spatiotemporally invariant historical occupancy maps and introduces a historical query module that conditions on the current agent state to iteratively retrieve agent-specific history from unlabeled occupancy representations. The retrieved history is summarized by a temporal map embedding and, together with the final query and map context, drives a DETR-style decoder to produce multi-modal future trajectories. This design lifts identity reliance, supports streaming inference via reusable encodings, and serves as a robust fallback when tracking is unavailable. On Argoverse~2, HiMAP achieves performance comparable to tracking-based methods while operating without IDs, and it substantially outperforms strong baselines in the no-tracking setting, yielding relative gains of 11\% in FDE, 12\% in ADE, and a 4\% reduction in MR over a fine-tuned QCNet. Beyond aggregate metrics, HiMAP delivers stable forecasts for all agents simultaneously without waiting for tracking to recover, highlighting its practical value for safety-critical autonomy. The code is available under: https://github.com/XuYiMing83/HiMAP.
Chinese Translation
准确的运动预测对于自动驾驶至关重要,但大多数预测器依赖于多目标跟踪(MOT)及身份关联,假设对象能够被正确且持续地跟踪。当由于遮挡、身份切换或漏检等原因导致跟踪失败时,预测质量会下降,安全风险会增加。我们提出了 extbf{HiMAP},一个无跟踪的轨迹预测框架,在MOT失败的情况下仍然保持可靠。HiMAP将过去的检测转换为时空不变的历史占用图,并引入一个历史查询模块,该模块根据当前代理状态迭代检索无标签占用表示中的代理特定历史。检索到的历史通过时间图嵌入进行汇总,并与最终查询和地图上下文一起驱动DETR风格的解码器生成多模态未来轨迹。该设计消除了对身份的依赖,通过可重用编码支持流式推理,并在跟踪不可用时作为强大的回退机制。在Argoverse~2上,HiMAP的性能与基于跟踪的方法相当,而在没有身份的情况下运行,并且在无跟踪设置中显著超越强基线,FDE相对提升11\%,ADE提升12\, MR减少4\, 超过经过微调的QCNet。除了聚合指标外,HiMAP为所有代理同时提供稳定的预测,而无需等待跟踪恢复,突显了其在安全关键自主系统中的实际价值。代码可在以下链接获取:https://github.com/XuYiMing83/HiMAP。
cs.CV / 25 / 2602.17250

Inferring Height from Earth Embeddings: First insights using Google AlphaEarth

从地球嵌入推断高度:使用谷歌AlphaEarth的初步见解
Hamoudzadeh, Alireza, Belloni, Valeria, Ravanelli, Roberta
Abstract
This study investigates whether the geospatial and multimodal features encoded in \textit{Earth Embeddings} can effectively guide deep learning (DL) regression models for regional surface height mapping. In particular, we focused on AlphaEarth Embeddings at 10 m spatial resolution and evaluated their capability to support terrain height inference using a high-quality Digital Surface Model (DSM) as reference. U-Net and U-Net++ architectures were thus employed as lightweight convolutional decoders to assess how well the geospatial information distilled in the embeddings can be translated into accurate surface height estimates. Both architectures achieved strong training performance (both with $R^2 = 0.97$), confirming that the embeddings encode informative and decodable height-related signals. On the test set, performance decreased due to distribution shifts in height frequency between training and testing areas. Nevertheless, U-Net++ shows better generalization ($R^2 = 0.84$, median difference = -2.62 m) compared with the standard U-Net ($R^2 = 0.78$, median difference = -7.22 m), suggesting enhanced robustness to distribution mismatch. While the testing RMSE (approximately 16 m for U-Net++) and residual bias highlight remaining challenges in generalization, strong correlations indicate that the embeddings capture transferable topographic patterns. Overall, the results demonstrate the promising potential of AlphaEarth Embeddings to guide DL-based height mapping workflows, particularly when combined with spatially aware convolutional architectures, while emphasizing the need to address bias for improved regional transferability.
Chinese Translation
本研究探讨了编码在 extit{Earth Embeddings}中的地理空间和多模态特征是否能够有效指导深度学习(DL)回归模型进行区域表面高度映射。特别地,我们关注于10米空间分辨率的AlphaEarth Embeddings,并评估其在使用高质量数字表面模型(DSM)作为参考时支持地形高度推断的能力。因此,采用了U-Net和U-Net++架构作为轻量级卷积解码器,以评估嵌入中提炼的地理空间信息如何转化为准确的表面高度估计。这两种架构均表现出强大的训练性能(均为$R^2 = 0.97$),确认了嵌入编码了信息丰富且可解码的高度相关信号。在测试集上,由于训练和测试区域之间高度频率的分布变化,性能有所下降。然而,与标准U-Net相比,U-Net++显示出更好的泛化能力($R^2 = 0.84$,中位数差异 = -2.62 m),这表明其对分布不匹配的鲁棒性增强。尽管测试均方根误差(RMSE,U-Net++约为16米)和残差偏差突显了泛化中的挑战,但强相关性表明嵌入捕捉到了可转移的地形模式。总体而言,结果展示了AlphaEarth Embeddings在指导基于DL的高度映射工作流程中的良好潜力,特别是在与空间感知卷积架构结合时,同时强调了需要解决偏差以提高区域可转移性的必要性。
cs.CV / 26 / 2602.17252

A Multi-modal Detection System for Infrastructure-based Freight Signal Priority

基于基础设施的货运信号优先多模态检测系统
Zhang, Ziyan, Wei, Chuheng, Zhao, Xuanpeng, Li, Siyan, Snyder, Will, Stas, Mike, Hao, Peng, Boriboonsomsin, Kanok, Wu, Guoyuan
Abstract
Freight vehicles approaching signalized intersections require reliable detection and motion estimation to support infrastructure-based Freight Signal Priority (FSP). Accurate and timely perception of vehicle type, position, and speed is essential for enabling effective priority control strategies. This paper presents the design, deployment, and evaluation of an infrastructure-based multi-modal freight vehicle detection system integrating LiDAR and camera sensors. A hybrid sensing architecture is adopted, consisting of an intersection-mounted subsystem and a midblock subsystem, connected via wireless communication for synchronized data transmission. The perception pipeline incorporates both clustering-based and deep learning-based detection methods with Kalman filter tracking to achieve stable real-time performance. LiDAR measurements are registered into geodetic reference frames to support lane-level localization and consistent vehicle tracking. Field evaluations demonstrate that the system can reliably monitor freight vehicle movements at high spatio-temporal resolution. The design and deployment provide practical insights for developing infrastructure-based sensing systems to support FSP applications.
Chinese Translation
接近信号控制交叉口的货运车辆需要可靠的检测和运动估计,以支持基于基础设施的货运信号优先(FSP)。准确及时地感知车辆类型、位置和速度对于实现有效的优先控制策略至关重要。本文介绍了一种基于基础设施的多模态货运车辆检测系统的设计、部署和评估,该系统集成了激光雷达(LiDAR)和摄像头传感器。采用了一种混合传感架构,包括一个安装在交叉口的子系统和一个中段子系统,通过无线通信连接以实现同步数据传输。感知管道结合了基于聚类和基于深度学习的检测方法,并使用卡尔曼滤波跟踪,以实现稳定的实时性能。激光雷达测量结果被注册到大地参考框架中,以支持车道级定位和一致的车辆跟踪。现场评估表明,该系统能够以高时空分辨率可靠地监测货运车辆的运动。设计和部署为开发支持FSP应用的基于基础设施的传感系统提供了实际的见解。
cs.CV / 27 / 2602.17260

EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection

EA-Swin:一种嵌入无关的Swin Transformer用于AI生成视频检测
Mai, Hung, Dinh, Loi, Nguyen, Duc Hai, Do, Dat, Doan, Luong, Quoc, Khanh Nguyen, Vu, Huan, Ho, Phong, Islam, Naeem Ul, Do, Tuan
Abstract
Recent advances in foundation video generators such as Sora2, Veo3, and other commercial systems have produced highly realistic synthetic videos, exposing the limitations of existing detection methods that rely on shallow embedding trajectories, image-based adaptation, or computationally heavy MLLMs. We propose EA-Swin, an Embedding-Agnostic Swin Transformer that models spatiotemporal dependencies directly on pretrained video embeddings via a factorized windowed attention design, making it compatible with generic ViT-style patch-based encoders. Alongside the model, we construct the EA-Video dataset, a benchmark dataset comprising 130K videos that integrates newly collected samples with curated existing datasets, covering diverse commercial and open-source generators and including unseen-generator splits for rigorous cross-distribution evaluation. Extensive experiments show that EA-Swin achieves 0.97-0.99 accuracy across major generators, outperforming prior SoTA methods (typically 0.8-0.9) by a margin of 5-20%, while maintaining strong generalization to unseen distributions, establishing a scalable and robust solution for modern AI-generated video detection.
Chinese Translation
近期,基础视频生成器如Sora2、Veo3及其他商业系统的进展,产生了高度逼真的合成视频,暴露了现有检测方法的局限性,这些方法依赖于浅层嵌入轨迹、基于图像的适应或计算量大的多模态大模型(MLLMs)。我们提出了EA-Swin,一种嵌入无关的Swin Transformer,通过分解窗口注意力设计直接在预训练的视频嵌入上建模时空依赖性,使其与通用的ViT风格的基于补丁的编码器兼容。与此同时,我们构建了EA-Video数据集,这是一个基准数据集,包含130K个视频,整合了新收集的样本与策划的现有数据集,涵盖了多种商业和开源生成器,并包括未见生成器的划分,以进行严格的跨分布评估。大量实验表明,EA-Swin在主要生成器上实现了0.97-0.99的准确率,超越了先前的最先进方法(通常为0.8-0.9),提升幅度为5-20%,同时对未见分布保持强大的泛化能力,为现代AI生成视频检测建立了一个可扩展且稳健的解决方案。
cs.CV / 28 / 2602.17277

Physics Encoded Spatial and Temporal Generative Adversarial Network for Tropical Cyclone Image Super-resolution

物理编码的空间与时间生成对抗网络用于热带气旋图像超分辨率
Zhang, Ruoyi, Yuan, Jiawei, Ye, Lujia, Yu, Runling, Zhao, Liling
Abstract
High-resolution satellite imagery is indispensable for tracking the genesis, intensification, and trajectory of tropical cyclones (TCs). However, existing deep learning-based super-resolution (SR) methods often treat satellite image sequences as generic videos, neglecting the underlying atmospheric physical laws governing cloud motion. To address this, we propose a Physics Encoded Spatial and Temporal Generative Adversarial Network (PESTGAN) for TC image super-resolution. Specifically, we design a disentangled generator architecture incorporating a PhyCell module, which approximates the vorticity equation via constrained convolutions and encodes the resulting approximate physical dynamics as implicit latent representations to separate physical dynamics from visual textures. Furthermore, a dual-discriminator framework is introduced, employing a temporal discriminator to enforce motion consistency alongside spatial realism. Experiments on the Digital Typhoon dataset for 4$\times$ upscaling demonstrate that PESTGAN establishes a better performance in structural fidelity and perceptual quality. While maintaining competitive pixel-wise accuracy compared to existing approaches, our method significantly excels in reconstructing meteorologically plausible cloud structures with superior physical fidelity.
Chinese Translation
高分辨率卫星图像对于追踪热带气旋(TC)的生成、增强和轨迹至关重要。然而,现有的基于深度学习的超分辨率(SR)方法往往将卫星图像序列视为通用视频,忽视了支配云运动的潜在大气物理法则。为了解决这个问题,我们提出了一种物理编码的空间与时间生成对抗网络(PESTGAN)用于热带气旋图像超分辨率。具体而言,我们设计了一种解耦生成器架构,结合了PhyCell模块,该模块通过约束卷积近似涡度方程,并将得到的近似物理动态编码为隐式潜在表示,以将物理动态与视觉纹理分离。此外,我们引入了双判别器框架,采用时间判别器来强制执行运动一致性以及空间真实感。在数字台风数据集上进行4倍放大实验表明,PESTGAN在结构保真度和感知质量方面表现更佳。在与现有方法相比保持竞争性像素级准确度的同时,我们的方法在重建气象上合理的云结构方面显著优于其他方法,具备更高的物理保真度。
cs.CV / 29 / 2602.17310

Attachment Anchors: A Novel Framework for Laparoscopic Grasping Point Prediction in Colorectal Surgery

附着锚点:一种用于结直肠手术中腹腔镜抓取点预测的新框架
Schneider, Dennis N., Wagner, Lars, Rueckert, Daniel, Wilhelm, Dirk
Abstract
Accurate grasping point prediction is a key challenge for autonomous tissue manipulation in minimally invasive surgery, particularly in complex and variable procedures such as colorectal interventions. Due to their complexity and prolonged duration, colorectal procedures have been underrepresented in current research. At the same time, they pose a particularly interesting learning environment due to repetitive tissue manipulation, making them a promising entry point for autonomous, machine learning-driven support. Therefore, in this work, we introduce attachment anchors, a structured representation that encodes the local geometric and mechanical relationships between tissue and its anatomical attachments in colorectal surgery. This representation reduces uncertainty in grasping point prediction by normalizing surgical scenes into a consistent local reference frame. We demonstrate that attachment anchors can be predicted from laparoscopic images and incorporated into a grasping framework based on machine learning. Experiments on a dataset of 90 colorectal surgeries demonstrate that attachment anchors improve grasping point prediction compared to image-only baselines. There are particularly strong gains in out-of-distribution settings, including unseen procedures and operating surgeons. These results suggest that attachment anchors are an effective intermediate representation for learning-based tissue manipulation in colorectal surgery.
Chinese Translation
准确的抓取点预测是自主组织操作在微创手术中的一个关键挑战,尤其是在结直肠干预等复杂且变化多端的程序中。由于其复杂性和较长的持续时间,结直肠手术在当前研究中相对较少。同时,由于重复的组织操作,它们提供了一个特别有趣的学习环境,使其成为基于自主机器学习支持的有希望的切入点。因此,在本研究中,我们引入了附着锚点,这是一种结构化表示,编码了结直肠手术中组织及其解剖附着之间的局部几何和机械关系。这种表示通过将手术场景标准化为一致的局部参考框架,从而减少了抓取点预测的不确定性。我们证明了可以从腹腔镜图像中预测附着锚点,并将其纳入基于机器学习的抓取框架。对90例结直肠手术数据集的实验表明,与仅使用图像的基线相比,附着锚点改善了抓取点预测。在未见程序和操作外科医生等分布外设置中,特别取得了显著的提升。这些结果表明,附着锚点是结直肠手术中基于学习的组织操作的有效中间表示。
cs.CV / 30 / 2602.17322

Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline

利用对比学习构建相似性引导的篡改文档数据生成管道
Dhouib, Mohamed, Buscaldi, Davide, Vanier, Sonia, Shabou, Aymen
Abstract
Detecting tampered text in document images is a challenging task due to data scarcity. To address this, previous work has attempted to generate tampered documents using rule-based methods. However, the resulting documents often suffer from limited variety and poor visual quality, typically leaving highly visible artifacts that are rarely observed in real-world manipulations. This undermines the model's ability to learn robust, generalizable features and results in poor performance on real-world data. Motivated by this discrepancy, we propose a novel method for generating high-quality tampered document images. We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives. We also train a second auxiliary network to evaluate whether a crop tightly encloses the intended characters, without cutting off parts of characters or including parts of adjacent ones. Using a carefully designed generation pipeline that leverages both networks, we introduce a framework capable of producing diverse, high-quality tampered document images. We assess the effectiveness of our data generation pipeline by training multiple models on datasets derived from the same source images, generated using our method and existing approaches, under identical training protocols. Evaluating these models on various open-source datasets shows that our pipeline yields consistent performance improvements across architectures and datasets.
Chinese Translation
在文档图像中检测篡改文本是一项具有挑战性的任务,主要由于数据稀缺。为了解决这一问题,以往的研究尝试使用基于规则的方法生成篡改文档。然而,生成的文档往往缺乏多样性且视觉质量较差,通常留下在现实世界操作中很少见到的明显伪影。这削弱了模型学习稳健、可推广特征的能力,并导致在真实数据上的表现不佳。基于这一差异,我们提出了一种新颖的方法来生成高质量的篡改文档图像。我们首先训练一个辅助网络来比较文本裁剪,利用对比学习并采用一种新策略来定义正样本对及其对应的负样本。我们还训练第二个辅助网络来评估一个裁剪是否紧密包围了目标字符,而不切割字符的部分或包含相邻字符的部分。通过精心设计的生成管道,结合这两个网络,我们引入了一个框架,能够生成多样化、高质量的篡改文档图像。我们通过在相同源图像生成的数据集上训练多个模型,使用我们的方法和现有方法在相同的训练协议下进行评估,从而评估我们的数据生成管道的有效性。在各种开源数据集上评估这些模型显示,我们的管道在不同架构和数据集上都能带来一致的性能提升。
cs.CV / 31 / 2602.17337

Polaffini: A feature-based approach for robust affine and polyaffine image registration

Polaffini:一种基于特征的稳健仿射和多仿射图像配准方法
Legouhy, Antoine, Campo, Cosimo, Callaghan, Ross, Azadbakht, Hojjat, Zhang, Hui
Abstract
In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.
Chinese Translation
在本研究中,我们提出了Polaffini,这是一个稳健且多功能的解剖基础配准框架。医学图像配准主要依赖于基于强度的配准方法,这些方法依赖于对齐质量的替代度量。相比之下,基于特征的方法通过识别明确的解剖对应关系来操作,尽管理论上更为理想,但由于可靠提取特征的挑战,这些方法在很大程度上已不再受到青睐。然而,得益于深度学习的最新进展,这些挑战现在已被显著克服,深度学习提供的预训练分割模型能够即时提供可靠的、细粒度的解剖轮廓。我们旨在展示这些进展可以被利用来创建新的解剖基础图像配准算法。为此,我们提出了Polaffini,该方法通过提取分割区域的质心,以一种特别简单的方式获得具有一对一对应关系的解剖基础特征点。这些特征点使得通过封闭形式解实现高效的全局和局部仿射匹配成为可能。所得到的整体变换范围从仿射到多仿射,并具有可调的平滑性。多仿射变换比仿射变换具有更多的自由度,从而实现更精细的对齐,并且其嵌入在对数欧几里得框架中确保了可微性。Polaffini既可以用于独立配准,也可以作为后续非线性配准的预对齐,我们将其与流行的基于强度的配准技术进行了评估。结果表明,Polaffini在结构对齐方面优于竞争方法,并为下游非线性配准提供了更好的初始化。Polaffini快速、稳健且准确,特别适合集成到医学图像处理管道中。
cs.CV / 32 / 2602.17372

Tree crop mapping of South America reveals links to deforestation and conservation

南美树木作物地图揭示了与森林砍伐和保护的联系
Jiang, Yuchang, Raichuk, Anton, Tong, Xiaoye, Garnot, Vivien Sainte Fare, Ortiz-Gonzalo, Daniel, Morris, Dan, Schindler, Konrad, Wegner, Jan Dirk, Neumann, Maxim
Abstract
Monitoring tree crop expansion is vital for zero-deforestation policies like the European Union's Regulation on Deforestation-free Products (EUDR). However, these efforts are hindered by a lack of highresolution data distinguishing diverse agricultural systems from forests. Here, we present the first 10m-resolution tree crop map for South America, generated using a multi-modal, spatio-temporal deep learning model trained on Sentinel-1 and Sentinel-2 satellite imagery time series. The map identifies approximately 11 million hectares of tree crops, 23% of which is linked to 2000-2020 forest cover loss. Critically, our analysis reveals that existing regulatory maps supporting the EUDR often classify established agriculture, particularly smallholder agroforestry, as "forest". This discrepancy risks false deforestation alerts and unfair penalties for small-scale farmers. Our work mitigates this risk by providing a high-resolution baseline, supporting conservation policies that are effective, inclusive, and equitable.
Chinese Translation
监测树木作物的扩展对于实现零砍伐政策至关重要,例如欧盟的无砍伐产品法规(EUDR)。然而,这些努力受到缺乏高分辨率数据的阻碍,无法区分多样化的农业系统与森林。在此,我们展示了南美首个10米分辨率的树木作物地图,该地图是利用基于Sentinel-1和Sentinel-2卫星影像时间序列训练的多模态时空深度学习模型生成的。该地图识别出约1100万公顷的树木作物,其中23%与2000-2020年的森林覆盖损失相关。关键的是,我们的分析揭示,现有支持EUDR的监管地图常常将已建立的农业,特别是小农户 agroforestry 误分类为“森林”。这种差异可能导致虚假的森林砍伐警报和对小规模农民的不公平惩罚。我们的研究通过提供高分辨率基线,降低了这一风险,支持有效、包容和公平的保护政策。
cs.CV / 33 / 2602.17387

DRetHTR: Linear-Time Decoder-Only Retentive Network for Handwritten Text Recognition

DRetHTR:线性时间解码器专用的保留网络用于手写文本识别
Kim, Changhun, Mayr, Martin, Gorges, Thomas, Wu, Fei, Seuret, Mathias, Maier, Andreas, Christlein, Vincent
Abstract
State-of-the-art handwritten text recognition (HTR) systems commonly use Transformers, whose growing key-value (KV) cache makes decoding slow and memory-intensive. We introduce DRetHTR, a decoder-only model built on Retentive Networks (RetNet). Compared to an equally sized decoder-only Transformer baseline, DRetHTR delivers 1.6-1.9x faster inference with 38-42% less memory usage, without loss of accuracy. By replacing softmax attention with softmax-free retention and injecting multi-scale sequential priors, DRetHTR avoids a growing KV cache: decoding is linear in output length in both time and memory. To recover the local-to-global inductive bias of attention, we propose layer-wise gamma scaling, which progressively enlarges the effective retention horizon in deeper layers. This encourages early layers to model short-range dependencies and later layers to capture broader context, mitigating the flexibility gap introduced by removing softmax. Consequently, DRetHTR achieves best reported test character error rates of 2.26% (IAM-A, en), 1.81% (RIMES, fr), and 3.46% (Bentham, en), and is competitive on READ-2016 (de) with 4.21%. This demonstrates that decoder-only RetNet enables Transformer-level HTR accuracy with substantially improved decoding speed and memory efficiency.
Chinese Translation
最先进的手写文本识别(HTR)系统通常使用变换器(Transformers),其不断增长的键值(KV)缓存使得解码变得缓慢且占用大量内存。我们提出了DRetHTR,这是一种基于保留网络(Retentive Networks, RetNet)的解码器专用模型。与同等规模的解码器专用变换器基线相比,DRetHTR在推理速度上提高了1.6-1.9倍,同时内存使用减少了38-42%,且没有损失准确性。通过用无softmax的保留机制替代softmax注意力,并注入多尺度序列先验,DRetHTR避免了不断增长的KV缓存:在时间和内存上,解码与输出长度呈线性关系。为了恢复注意力的局部到全局归纳偏置,我们提出了逐层伽马缩放,这逐步扩大了深层中的有效保留视野。这鼓励早期层建模短期依赖,而后期层捕捉更广泛的上下文,从而减轻了去除softmax所引入的灵活性差距。因此,DRetHTR在测试字符错误率上取得了最佳报告结果:2.26%(IAM-A,英语),1.81%(RIMES,法语)和3.46%(Bentham,英语),在READ-2016(德语)上也表现出竞争力,达到了4.21%。这表明解码器专用的RetNet能够实现与变换器水平相当的HTR准确性,同时显著提高了解码速度和内存效率。
cs.CV / 34 / 2602.17395

SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

SpectralGCD:用于广义类别发现的谱概念选择与跨模态表示学习
Caselli, Lorenzo, Mistretta, Marco, Magistri, Simone, Bagdanov, Andrew D.
Abstract
Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering which exploits a cross-modal covariance matrix over the softmaxed similarities measured by a strong teacher model to automatically retain only relevant concepts from the dictionary. Forward and reverse knowledge distillation from the same teacher ensures that the cross-modal representations of the student remain both semantically sufficient and well-aligned. Across six benchmarks, SpectralGCD delivers accuracy comparable to or significantly superior to state-of-the-art methods at a fraction of the computational cost. The code is publicly available at: https://github.com/miccunifi/SpectralGCD.
Chinese Translation
广义类别发现(GCD)旨在在未标记数据中识别新类别,同时利用少量已知类别的标记子集。仅基于图像特征训练参数分类器往往会导致对旧类别的过拟合,而最近的多模态方法通过结合文本信息来提高性能。然而,这些方法独立处理模态,且计算成本较高。我们提出了SpectralGCD,一种高效且有效的多模态GCD方法,利用CLIP跨模态图像-概念相似性作为统一的跨模态表示。每个图像被表示为来自大型任务无关字典的语义概念的混合,这将学习锚定到明确的语义上,并减少对虚假视觉线索的依赖。为了保持高效学生学习的表示的语义质量,我们引入了谱过滤(Spectral Filtering),该方法利用强教师模型测量的softmax相似性上的跨模态协方差矩阵,自动保留字典中仅与任务相关的概念。来自同一教师的前向和反向知识蒸馏确保学生的跨模态表示在语义上既充足又良好对齐。在六个基准测试中,SpectralGCD在计算成本的极小部分下,提供了与最先进方法相当或显著优越的准确性。代码可在以下网址公开获取:https://github.com/miccunifi/SpectralGCD。
cs.CV / 35 / 2602.17397

A High-Level Survey of Optical Remote Sensing

光学遥感的高级综述
Koletsis, Panagiotis, Efthymiou, Vasilis, Vakalopoulou, Maria, Komodakis, Nikos, Doulamis, Anastasios, Papadopoulos, Georgios Th.
Abstract
In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.
Chinese Translation
近年来,计算机视觉的显著进展也推动了遥感技术的发展。同时,无人机的使用不断扩大,许多组织将其纳入日常操作中。大多数无人机默认配备RGB相机,这些相机既坚固又是最易于使用和解读的传感器之一。关于光学遥感的文献浩如烟海,涵盖了多种任务、能力和方法论。每个任务或方法论都可能值得进行专门的综述。本研究提供了该领域能力的全面概述,同时呈现了关键的信息,如数据集和见解。其目的是为进入该领域的研究人员提供指导,提供高层次的见解,并帮助他们专注于与其兴趣最相关的领域。据我们所知,目前没有现有的综述能够涵盖这一整体视角。
cs.CV / 36 / 2602.17419

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

EAGLE:用于多模态大型语言模型的无调优工业异常检测的专家增强注意力引导
Peng, Xiaomeng, Huang, Xilang, Choi, Seon Han
Abstract
Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}
Chinese Translation
工业异常检测对智能制造至关重要,但许多深度学习方法仅产生二元决策,并提供有限的语义解释。多模态大型语言模型(MLLMs)有潜力生成细粒度的基于语言的分析,然而现有方法通常需要昂贵的微调,并且在异常检测准确性方面并未持续优于轻量级专业检测器。我们提出了一种用于MLLMs工业异常检测的专家增强注意力引导(EAGLE),这是一种无调优框架,整合专家模型的输出,引导MLLMs实现准确检测和可解释的异常描述。我们进一步研究EAGLE如何影响MLLMs的内部机制,通过检查MLLMs在中间层对异常图像区域的注意力分布。我们观察到,成功的异常检测与对异常区域的注意力集中度增加相关,EAGLE倾向于促进这种对齐。在MVTec-AD和VisA上的实验表明,EAGLE在多个MLLMs上提高了异常检测性能,而无需任何参数更新,取得了与基于微调的方法相当的结果。代码可在 exttt{https://github.com/shengtun/Eagle} 获取。
cs.CV / 37 / 2602.17473

4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

任意相机运动下的4D单目外科重建
Shan, Jiwei, Cai, Zeyu, Hsieh, Cheng-Tai, Li, Yirui, Liu, Hao, Han, Lijun, Wang, Hesheng, Cheng, Shing Shin
Abstract
Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.
Chinese Translation
从内窥镜视频中重建可变形的外科场景既具有挑战性又在临床上至关重要。基于隐式神经表示或3D高斯点云的最新技术已取得显著进展。然而,大多数方法是为固定内窥镜视角的可变形场景设计的,并依赖于立体深度先验或准确的运动重建进行初始化和优化,这限制了它们在真实临床环境中处理大相机运动的单目序列的能力。为了解决这一问题,我们提出了Local-EndoGS,一种针对任意相机运动的单目内窥镜序列的高质量4D重建框架。Local-EndoGS引入了一种渐进的基于窗口的全局表示,将局部可变形场景模型分配给每个观察窗口,从而实现对具有大量运动的长序列的可扩展性。为了克服在没有立体深度或准确运动重建的情况下不可靠的初始化,我们设计了一种粗到精的策略,结合了多视图几何、跨窗口信息和单目深度先验,为优化提供了稳健的基础。我们进一步结合了长距离2D像素轨迹约束和物理运动先验,以提高变形的合理性。在三个具有可变形场景和不同相机运动的公共内窥镜数据集上的实验表明,Local-EndoGS在外观质量和几何结构上始终优于最新的方法。消融研究验证了我们关键设计的有效性。代码将在接受后发布于:https://github.com/IRMVLab/Local-EndoGS。
cs.CV / 38 / 2602.17478

QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery

QuPAINT:一种物理感知的指令调优方法用于量子材料发现
Nguyen, Xuan-Bac, Nguyen, Hoang-Quan, Pandey, Sankalp, Faltermeier, Tim, Borys, Nicholas, Churchill, Hugh, Luu, Khoa
Abstract
Characterizing two-dimensional quantum materials from optical microscopy images is challenging due to the subtle layer-dependent contrast, limited labeled data, and significant variation across laboratories and imaging setups. Existing vision models struggle in this domain since they lack physical priors and cannot generalize to new materials or hardware conditions. This work presents a new physics-aware multimodal framework that addresses these limitations from both the data and model perspectives. We first present Synthia, a physics-based synthetic data generator that simulates realistic optical responses of quantum material flakes under thin-film interference. Synthia produces diverse and high-quality samples, helping reduce the dependence on expert manual annotation. We introduce QMat-Instruct, the first large-scale instruction dataset for quantum materials, comprising multimodal, physics-informed question-answer pairs designed to teach Multimodal Large Language Models (MLLMs) to understand the appearance and thickness of flakes. Then, we propose Physics-Aware Instruction Tuning (QuPAINT), a multimodal architecture that incorporates a Physics-Informed Attention module to fuse visual embeddings with optical priors, enabling more robust and discriminative flake representations. Finally, we establish QF-Bench, a comprehensive benchmark spanning multiple materials, substrates, and imaging settings, offering standardized protocols for fair and reproducible evaluation.
Chinese Translation
从光学显微镜图像中表征二维量子材料具有挑战性,因为存在微妙的层依赖对比、有限的标注数据以及实验室和成像设置之间的显著变化。现有的视觉模型在这一领域表现不佳,因为它们缺乏物理先验,无法推广到新材料或硬件条件。本文提出了一种新的物理感知多模态框架,从数据和模型两个角度解决这些局限性。我们首先介绍了Synthia,一个基于物理的合成数据生成器,能够模拟量子材料薄片在薄膜干涉下的真实光学响应。Synthia生成多样且高质量的样本,有助于减少对专家手动标注的依赖。我们引入了QMat-Instruct,这是首个针对量子材料的大规模指令数据集,包括多模态、物理信息的问答对,旨在教会多模态大型语言模型(MLLMs)理解薄片的外观和厚度。接着,我们提出了物理感知指令调优(QuPAINT),一种多模态架构,结合了物理信息注意力模块,将视觉嵌入与光学先验融合,从而实现更强健和具有辨别力的薄片表示。最后,我们建立了QF-Bench,这是一个涵盖多种材料、基底和成像设置的综合基准,提供标准化的协议以实现公平和可重复的评估。
cs.CV / 39 / 2602.17484

Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

追踪复制像素与规范化补丁相似度在复制检测中的应用
Lu, Yichen, Nie, Siwei, Lu, Minlong, Yang, Xudong, Zhang, Xiaobo, Zhang, Peng
Abstract
Image Copy Detection (ICD) aims to identify manipulated content between image pairs through robust feature representation learning. While self-supervised learning (SSL) has advanced ICD systems, existing view-level contrastive methods struggle with sophisticated edits due to insufficient fine-grained correspondence learning. We address this limitation by exploiting the inherent geometric traceability in edited content through two key innovations. First, we propose PixTrace - a pixel coordinate tracking module that maintains explicit spatial mappings across editing transformations. Second, we introduce CopyNCE, a geometrically-guided contrastive loss that regularizes patch affinity using overlap ratios derived from PixTrace's verified mappings. Our method bridges pixel-level traceability with patch-level similarity learning, suppressing supervision noise in SSL training. Extensive experiments demonstrate not only state-of-the-art performance (88.7% uAP / 83.9% RP90 for matcher, 72.6% uAP / 68.4% RP90 for descriptor on DISC21 dataset) but also better interpretability over existing methods.
Chinese Translation
图像复制检测(Image Copy Detection, ICD)旨在通过稳健的特征表示学习识别图像对之间的操纵内容。尽管自监督学习(Self-Supervised Learning, SSL)推动了ICD系统的发展,但现有的视图级对比方法在面对复杂编辑时,由于缺乏细粒度的对应学习而表现不佳。我们通过两个关键创新来解决这一局限性,利用编辑内容中固有的几何可追溯性。首先,我们提出了PixTrace——一个像素坐标追踪模块,能够在编辑变换中保持明确的空间映射。其次,我们引入了CopyNCE,这是一种几何引导的对比损失,通过利用PixTrace验证的映射中得出的重叠比率来规范化补丁相似度。我们的方法将像素级可追溯性与补丁级相似度学习相结合,抑制了SSL训练中的监督噪声。大量实验表明,我们的方法不仅在性能上达到了最先进的水平(在DISC21数据集上,匹配器的uAP为88.7%,RP90为83.9%;描述符的uAP为72.6%,RP90为68.4%),而且在可解释性上也优于现有方法。
cs.CV / 40 / 2602.17517

FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

基于FoundationPose初始化的3D-2D肝脏配准用于外科增强现实
Zhang, Hanyuan, He, Lucas, He, Runlong, Kadkhodamohammadi, Abdolrahim, Stoyanov, Danail, Davidson, Brian R., Mazomenos, Evangelos B., Clarkson, Matthew J.
Abstract
Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.
Chinese Translation
增强现实可以改善腹腔镜肝脏手术中的肿瘤定位。现有的配准流程通常依赖于器官轮廓;可变形(非刚性)对齐通常通过有限元(FE)模型与降维或机器学习组件相结合来处理。我们将腹腔镜深度图与基础姿态估计器结合,用于相机-肝脏姿态估计,并用非刚性迭代最近点(NICP)替代基于有限元的变形,以降低工程/建模复杂性和专业知识要求。在真实患者数据上,深度增强的基础姿态方法在3个案例中实现了9.91毫米的平均配准误差。结合刚性-NICP的配准优于仅刚性配准,证明了NICP是有限元可变形模型的有效替代方案。该流程在实现临床相关精度的同时,提供了一种轻量级、工程友好的有限元变形替代方案。
cs.CV / 41 / 2602.17535

LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

LATA:用于医疗视觉语言模型中符合不确定性的拉普拉斯辅助传导适应
Bozorgtabar, Behzad, Mahapatra, Dwarikanath, Roy, Sudipta, Naseer, Muzammal, Razzak, Imran, Ge, Zongyuan
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
Chinese Translation
医疗视觉语言模型(VLMs)在医学影像中是强大的零样本识别器,但它们在领域转移下的可靠性依赖于经过校准的不确定性及其保证。分裂符合预测(SCP)提供有限样本覆盖,但预测集往往变得庞大(低效率),且类别间覆盖不平衡——高类别条件覆盖差距(CCV),尤其是在少样本、不平衡的情况下;此外,简单地适应校准标签会破坏可交换性并使保证失效。我们提出了 exttt{ extbf{LATA}}(拉普拉斯辅助传导适应),这是一种 extit{无训练和标签}的精炼方法,通过在图像-图像 k-NN 图上平滑零样本概率,利用少量的CCCP均值场更新,操作于联合校准和测试池,保持SCP的有效性通过确定性变换。我们进一步引入了一种 extit{关注失败}的符合评分,嵌入到视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签合理性,以提高预测集的效率和在固定覆盖下的类别平衡。 exttt{ extbf{LATA}}是黑箱的(无VLM更新),计算轻量(窗口传导,无反向传播),并包括一个可选的先验调节器,可以严格无标签运行,或者在需要时使用一次校准边际以标签信息的变体运行。在 extbf{三}个医疗VLM和 extbf{九}个下游任务中, exttt{ extbf{LATA}}始终减少集合大小和CCV,同时匹配或收紧目标覆盖,超越了先前的传导基线,并缩小了与使用标签的方法之间的差距,同时使用的计算量远少于前者。全面的消融实验和定性分析表明, exttt{ extbf{LATA}}在不妨碍可交换性的情况下,锐化了零样本预测。
cs.CV / 42 / 2602.17555

GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

GraphThinker:通过事件图思维增强视频推理
Cheng, Zixu, Li, Da, Hu, Jian, Liu, Ziquan, Li, Wei, Gong, Shaogang
Abstract
Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
Chinese Translation
视频推理需要理解视频中事件之间的因果关系。然而,这些关系通常是隐含的,并且手动标注成本高昂。虽然现有的多模态大型语言模型(MLLMs)通常通过密集的字幕或视频摘要推断事件关系以进行视频推理,但这种建模仍然缺乏因果理解。如果不在视频事件内部和跨事件建模显式的因果结构,这些模型在视频推理中会出现幻觉。在本研究中,我们提出了GraphThinker,一种基于强化微调的方法,构建结构化的事件级场景图,并增强视觉基础,以共同减少视频推理中的幻觉。具体而言,我们首先采用MLLM构建一个基于事件的视频场景图(EVSG),该图明确建模了事件内部和跨事件的关系,并将这些形成的场景图作为中间思维过程纳入MLLM中。我们还在强化微调过程中引入了视觉注意力奖励,这增强了视频基础并进一步减轻了幻觉。我们在两个数据集RexTime和VidHalluc上评估了GraphThinker,结果显示其在捕捉物体和事件关系方面具有更强的能力,事件定位更精确,相较于先前的方法减少了视频推理中的幻觉。
cs.CV / 43 / 2602.17558

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

RetouchIQ:基于指令的图像修饰的多模态大语言模型代理与通用奖励
Wu, Qiucheng, Shi, Jing, Jenni, Simon, Kafle, Kushal, Wang, Tianyu, Chang, Shiyu, Zhao, Handong
Abstract
Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
Chinese Translation
最近,多模态大语言模型(MLLMs)的进展显示出将视觉-语言推理扩展到专业工具基础的图像编辑的巨大潜力,使直观和创造性的编辑成为可能。一个有前景的方向是利用强化学习(RL)使MLLM能够推理并执行在专业图像编辑软件中的最佳工具使用计划。然而,由于缺乏可靠、可验证的奖励信号来反映创造性编辑固有的主观性,训练仍然具有挑战性。在本研究中,我们介绍了RetouchIQ,一个通过由通用奖励模型指导的MLLM代理执行基于指令的可执行图像编辑的框架。RetouchIQ解释用户指定的编辑意图,并生成相应的可执行图像调整,桥接高层次的美学目标与精确的参数控制。为了超越传统的基于规则的奖励,这些奖励通过手工制作的指标计算与固定参考图像的相似性,我们提出了一种通用奖励模型,这是一种经过强化学习微调的MLLM,通过一组生成的指标逐案评估编辑结果。然后,奖励模型通过多模态推理提供标量反馈,使得强化学习能够获得高质量、一致的指令梯度。我们整理了一个扩展数据集,包含190k个指令-推理对,并建立了基于指令的图像编辑的新基准。实验表明,RetouchIQ在语义一致性和感知质量上显著优于之前的基于MLLM和扩散的编辑系统。我们的研究结果展示了通用奖励驱动的MLLM代理作为专业图像编辑的灵活、可解释和可执行助手的潜力。
cs.CV / 44 / 2602.17599

Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment

Art2Mus:通过视觉条件和大规模跨模态对齐实现艺术作品到音乐的生成
Rinaldi, Ivan, Mendula, Matteo, Fanelli, Nicola, Levé, Florence, Testi, Matteo, Castellano, Giovanna, Vessio, Gennaro
Abstract
Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.
Chinese Translation
音乐生成通过多模态深度学习取得了显著进展,使得模型能够从文本合成音频,最近也能从图像合成音频。然而,现有的图像条件系统存在两个基本限制:(i)它们通常是在自然照片上训练的,限制了其捕捉艺术作品更丰富的语义、风格和文化内容的能力;(ii)大多数依赖于图像到文本的转换阶段,使用语言作为简化条件的语义捷径,但这阻碍了直接的视觉到音频学习。基于这些不足,我们引入了ArtSound,这是一个包含105,884对艺术作品与音乐的大规模多模态数据集,配有双模态标题,数据集通过扩展ArtGraph和自由音乐档案获得。我们进一步提出了ArtToMus,这是第一个明确设计用于直接艺术作品到音乐生成的框架,它将数字化的艺术作品映射到音乐,而无需图像到文本的翻译或基于语言的语义监督。该框架将视觉嵌入投影到潜在扩散模型的条件空间,使得音乐合成仅由视觉信息引导。实验结果表明,ArtToMus生成的输出在音乐上是一致的,并且在风格上与源艺术作品的显著视觉线索相符。尽管绝对对齐分数仍低于文本条件系统的分数——考虑到去除语言监督的难度显著增加是可以预期的——ArtToMus在感知质量和有意义的跨模态对应性方面表现出竞争力。本研究确立了直接的视觉到音乐生成作为一个独特且具有挑战性的研究方向,并提供了支持多媒体艺术、文化遗产和人工智能辅助创作实践应用的资源。代码和数据集将在接受后公开发布。
cs.CV / 45 / 2602.17605

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

动态适应:基于相关性的在线元学习与潜在概念的地理空间发现
Khan, Jowaria, Sarkar, Anindya, Vorobeychik, Yevgeniy, Bondi-Kelly, Elizabeth
Abstract
In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.
Chinese Translation
在许多现实世界的场景中,例如环境监测、灾害响应或公共卫生,由于数据收集成本高且困难,以及环境动态变化,从未观察区域进行战略性采样对于在资源紧张的情况下有效揭示隐藏目标至关重要。然而,稀疏和偏倚的地理空间真实数据限制了现有基于学习的方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,整合了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*的关键创新,该概念捕捉了领域特定因素如何影响目标的存在:*概念加权的不确定性采样策略*,其中不确定性由基于现成可用的领域特定概念(例如,土地覆盖、源接近度)学习的相关性进行调节;以及*关注相关性的元批次形成策略*,该策略在在线元更新过程中促进语义多样性,从而提高在动态环境中的泛化能力。我们的实验包括在真实世界的数据集上测试致癌PFAS(全氟和多氟烷基物质)污染,展示了我们的方法在数据有限和环境变化情况下揭示目标的可靠性。
cs.CV / 46 / 2602.17636

CORAL: Correspondence Alignment for Improved Virtual Try-On

CORAL:改进虚拟试穿的对应对齐
Kim, Jiyoung, Shin, Youngjin, Jin, Siyoon, Chung, Dahyun, Nam, Jisu, Kim, Tongmin, Park, Jongjae, Kang, Hyeonwoo, Kim, Seungryong
Abstract
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
Chinese Translation
现有的虚拟试穿(VTON)方法常常难以保留精细的服装细节,尤其是在需要准确的人物-服装对应关系的无配对设置中。这些方法并未明确强制执行人物-服装对齐,且未能解释在扩散变换器(Diffusion Transformers, DiTs)中对应关系是如何产生的。本文首先分析了基于DiT架构的全3D注意力,揭示了人物-服装对应关系在很大程度上依赖于全3D注意力中精确的人物-服装查询-键匹配。基于这一见解,我们引入了对应对齐(CORrespondence ALignment, CORAL),这是一个基于DiT的框架,明确将查询-键匹配与稳健的外部对应关系对齐。CORAL集成了两个互补组件:一个对应蒸馏损失,用于将可靠的匹配与人物-服装注意力对齐,以及一个熵最小化损失,用于锐化注意力分布。我们进一步提出了一种基于VLM的评估协议,以更好地反映人类偏好。CORAL在基线之上始终表现出改进,增强了全局形状转移和局部细节保留。大量消融实验验证了我们的设计选择。
cs.CV / 47 / 2602.17639

IntRec: Intent-based Retrieval with Contrastive Refinement

IntRec:基于意图的对比精炼检索
Shamsolmoali, Pourya, Zareapoor, Masoumeh, Granger, Eric, Lu, Yue
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
Chinese Translation
从复杂场景中检索用户指定的对象仍然是一项具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象时。现有的开放词汇检测器以一次性方式运作,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,一个交互式对象检索框架,能够基于用户反馈精炼预测。其核心是一个意图状态(Intent State, IS),该状态维护正锚点(确认线索)和负约束(拒绝假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性,同时惩罚被拒绝的对象,对候选对象进行排名,从而在杂乱场景中实现细粒度的消歧。我们的交互式框架在没有额外监督的情况下显著提高了检索准确性。在LVIS数据集上,IntRec达到了35.4的AP,相比于OVMR、CoDet和CAKE分别提高了+2.3、+3.7和+0.5。在具有挑战性的LVIS-Ambiguous基准测试中,经过一次纠正反馈后,其性能比一次性基线提高了+7.9 AP,每次交互增加的延迟不到30毫秒。
cs.CV / 48 / 2602.17650

Human-level 3D shape perception emerges from multi-view learning

人类水平的三维形状感知源于多视角学习
Bonnen, Tyler, Malik, Jitendra, Kanazawa, Angjoo
Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
Chinese Translation
人类能够从二维视觉输入中推断物体的三维结构。对这种能力的建模一直是视觉智能科学与工程的长期目标,但数十年的计算方法未能达到人类的表现。在此,我们开发了一种建模框架,可以直接从实验刺激中预测人类对任意物体的三维形状推断。我们通过一种新型神经网络类实现了这一目标,该网络使用视觉空间目标在自然感官数据上进行训练;给定一组来自自然场景中不同位置的图像,这些模型学习预测与这些图像相关的空间信息,如相机位置和视觉深度,而不依赖于任何与物体相关的归纳偏见。值得注意的是,这些视觉空间信号类似于人类 readily 可获得的感官线索。我们设计了一种零样本评估方法,以确定这些“多视角”模型在一个已建立的三维感知任务上的表现,然后比较模型与人类的行为。我们的建模框架首次在三维形状推断中匹配人类的准确性,即使没有特定任务的训练或微调。值得注意的是,模型响应的独立读取预测了人类行为的细致度量,包括错误模式和反应时间,揭示了模型动态与人类感知之间的自然对应关系。综合来看,我们的发现表明,人类水平的三维感知可以通过对自然视觉空间数据的简单、可扩展的学习目标而产生。所有代码、人类行为数据和重现我们发现所需的实验刺激均可在我们的项目页面上找到。
cs.CV / 49 / 2602.17659

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

当视觉超越语言:评估与缓解视觉语言行动模型中的反事实失败
Fang, Yu, Feng, Yuchun, Jing, Dong, Liu, Jiaqi, Yang, Yue, Wei, Zhenyu, Szafir, Daniel, Ding, Mingyu
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $\pi_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
Chinese Translation
视觉语言行动模型(VLAs)承诺将语言指令与机器人控制相结合,但在实际应用中往往未能忠实地遵循语言。当面临缺乏强场景特定监督的指令时,VLAs会遭遇反事实失败:它们基于数据集偏差引发的视觉捷径进行操作,反复执行已学会的行为,并选择在训练中频繁出现的对象,而不考虑语言意图。为了系统地研究这一现象,我们引入了LIBERO-CF,这是第一个针对VLAs的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来评估语言遵循能力。我们的评估揭示了反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导(Counterfactual Action Guidance, CAG),这是一种简单而有效的双分支推理方案,明确地对VLAs中的语言条件进行正则化。CAG将标准VLA策略与无语言条件的视觉-行动(Vision-Action, VA)模块相结合,使得在行动选择过程中能够进行反事实比较。这一设计减少了对视觉捷径的依赖,提高了在观察不足任务上的鲁棒性,并且不需要额外的演示或对现有架构或预训练模型的修改。大量实验表明其在多种VLAs中的即插即用集成及一致性改进。例如,在LIBERO-CF上,CAG在语言遵循准确性上提高了9.7%的$ ext{π}_{0.5}$,在观察不足任务上的任务成功率提高了3.6%,并且在与VA模型配对时,分别进一步提升了15.5%和8.5%。在现实世界评估中,CAG平均减少了9.4%的反事实失败,并提高了17.2%的任务成功率。
cs.CV / 50 / 2602.17665

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

OpenEarthAgent:一种工具增强地理空间代理的统一框架
Shabbir, Akashah, Sheikh, Muhammad Umer, Munir, Muhammad Akhtar, Debary, Hiyam, Fiaz, Mustansar, Zaheer, Muhammad Zaigham, Fraccaro, Paolo, Khan, Fahad Shahbaz, Khan, Muhammad Haris, Zhu, Xiao Xiang, Khan, Salman
Abstract
Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
Chinese Translation
近期在多模态推理方面的进展使得代理能够解读图像,将其与语言连接,并执行结构化分析任务。然而,将这些能力扩展到遥感领域仍然面临挑战,因为模型必须在保持连贯的多步骤逻辑的同时,对空间尺度、地理结构和多光谱指数进行推理。为了解决这一问题,OpenEarthAgent提出了一种统一框架,用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。该训练流程依赖于对结构化推理轨迹的监督微调,使模型与在多样化分析背景下经过验证的多步骤工具交互对齐。相关语料库包含14,538个训练实例和1,169个评估实例,训练集中的推理步骤超过10万,评估集中的推理步骤超过7千。该语料库涵盖城市、环境、灾害和基础设施领域,并结合了基于GIS的操作以及NDVI、NBR和NDBI等指数分析。基于明确的推理轨迹,学习到的代理展示了结构化推理、稳定的空间理解和可解释的行为,通过工具驱动的地理空间交互在多样化条件下表现出色。我们报告了相对于强基线的持续改进,以及与近期开放和闭源模型的竞争性能。
人工智能 (Artificial Intelligence)
67
cs.AI / 1 / 2602.16714

AIdentifyAGE Ontology for Decision Support in Forensic Dental Age Assessment

用于法医牙齿年龄评估决策支持的 AIdentifyAGE 本体
Marcelo, Renato, Rodrigues, Ana, Pereira, Cristiana Palmela, Figueiras, António, Santos, Rui, Figueira, José Rui, Francisco, Alexandre P, Vaz, Cátia
Abstract
Age assessment is crucial in forensic and judicial decision-making, particularly in cases involving undocumented individuals and unaccompanied minors, where legal thresholds determine access to protection, healthcare, and judicial procedures. Dental age assessment is widely recognized as one of the most reliable biological approaches for adolescents and young adults, but current practices are challenged by methodological heterogeneity, fragmented data representation, and limited interoperability between clinical, forensic, and legal information systems. These limitations hinder transparency and reproducibility, amplified by the increasing adoption of AI- based methods. The AIdentifyAGE ontology is domain-specific and provides a standardized, semantically coherent framework, encompassing both manual and AI-assisted forensic dental age assessment workflows, and enabling traceable linkage between observations, methods, reference data, and reported outcomes. It models the complete medico-legal workflow, integrating judicial context, individual-level information, forensic examination data, dental developmental assessment methods, radiographic imaging, statistical reference studies, and AI-based estimation methods. It is being developed together with domain experts, and it builds on upper and established biomedical, dental, and machine learning ontologies, ensuring interoperability, extensibility, and compliance with FAIR principles. The AIdentifyAGE ontology is a fundamental step to enhance consistency, transparency, and explainability, establishing a robust foundation for ontology-driven decision support systems in medico-legal and judicial contexts.
Chinese Translation
年龄评估在法医和司法决策中至关重要,尤其是在涉及无证人员和无陪伴未成年人的案件中,法律门槛决定了获得保护、医疗保健和司法程序的资格。牙齿年龄评估被广泛认为是对青少年和年轻成年人最可靠的生物学方法之一,但当前的实践面临方法学异质性、数据表示碎片化以及临床、法医和法律信息系统之间的互操作性有限等挑战。这些局限性阻碍了透明度和可重复性,尤其是在基于人工智能(AI)的方法日益普及的背景下。AIdentifyAGE 本体是特定领域的,提供了一个标准化、语义一致的框架,涵盖了手动和 AI 辅助的法医牙齿年龄评估工作流程,并能够在观察、方法、参考数据和报告结果之间建立可追溯的联系。它建模了完整的医学法律工作流程,整合了司法背景、个体信息、法医检查数据、牙齿发育评估方法、放射影像、统计参考研究和基于 AI 的估计方法。该本体与领域专家共同开发,并基于上层和已建立的生物医学、牙科和机器学习本体,确保互操作性、可扩展性,并符合 FAIR 原则。AIdentifyAGE 本体是增强一致性、透明度和可解释性的基础步骤,为医学法律和司法背景下的本体驱动决策支持系统建立了坚实的基础。
cs.AI / 2 / 2602.16715

Retrieval Augmented (Knowledge Graph), and Large Language Model-Driven Design Structure Matrix (DSM) Generation of Cyber-Physical Systems

基于检索增强(知识图谱)和大型语言模型驱动的网络物理系统设计结构矩阵(DSM)生成
Bank, H. Sinan, Herber, Daniel R.
Abstract
We explore the potential of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Graph-based RAG (GraphRAG) for generating Design Structure Matrices (DSMs). We test these methods on two distinct use cases -- a power screwdriver and a CubeSat with known architectural references -- evaluating their performance on two key tasks: determining relationships between predefined components, and the more complex challenge of identifying components and their subsequent relationships. We measure the performance by assessing each element of the DSM and overall architecture. Despite design and computational challenges, we identify opportunities for automated DSM generation, with all code publicly available for reproducibility and further feedback from the domain experts.
Chinese Translation
我们探讨了大型语言模型(LLMs)、检索增强生成(RAG)和基于图的RAG(GraphRAG)在生成设计结构矩阵(DSMs)方面的潜力。我们在两个不同的用例上测试了这些方法——一个电动螺丝刀和一个具有已知架构参考的CubeSat,评估它们在两个关键任务上的表现:确定预定义组件之间的关系,以及更复杂的挑战,即识别组件及其后续关系。我们通过评估DSM的每个元素和整体架构来衡量性能。尽管面临设计和计算挑战,我们识别出自动化DSM生成的机会,所有代码均已公开以便于重现和进一步获得领域专家的反馈。
cs.AI / 3 / 2602.16716

Contextuality from Single-State Representations: An Information-Theoretic Principle for Adaptive Intelligence

来自单状态表征的情境性:自适应智能的信息理论原则
Kim, Song-Ju
Abstract
Adaptive systems often operate across multiple contexts while reusing a fixed internal state space due to constraints on memory, representation, or physical resources. Such single-state reuse is ubiquitous in natural and artificial intelligence, yet its fundamental representational consequences remain poorly understood. We show that contextuality is not a peculiarity of quantum mechanics, but an inevitable consequence of single-state reuse in classical probabilistic representations. Modeling contexts as interventions acting on a shared internal state, we prove that any classical model reproducing contextual outcome statistics must incur an irreducible information-theoretic cost: dependence on context cannot be mediated solely through the internal state. We provide a minimal constructive example that explicitly realizes this cost and clarifies its operational meaning. We further explain how nonclassical probabilistic frameworks avoid this obstruction by relaxing the assumption of a single global joint probability space, without invoking quantum dynamics or Hilbert space structure. Our results identify contextuality as a general representational constraint on adaptive intelligence, independent of physical implementation.
Chinese Translation
自适应系统通常在多个情境中操作,同时由于内存、表征或物理资源的限制而重用固定的内部状态空间。这种单状态重用在自然和人工智能中普遍存在,但其基本的表征后果仍然不够清楚。我们表明,情境性并不是量子力学的特性,而是经典概率表征中单状态重用的必然结果。将情境建模为作用于共享内部状态的干预,我们证明了任何重现情境结果统计的经典模型都必须承担不可约的信息理论成本:对情境的依赖不能仅通过内部状态来中介。我们提供了一个最小的构造性示例,明确实现了这一成本并阐明其操作意义。我们进一步解释了非经典概率框架如何通过放宽单一全局联合概率空间的假设来避免这一障碍,而无需引入量子动力学或希尔伯特空间结构。我们的结果将情境性识别为对自适应智能的一种普遍表征约束,与物理实现无关。
cs.AI / 4 / 2602.16727

Mobility-Aware Cache Framework for Scalable LLM-Based Human Mobility Simulation

面向移动性的缓存框架用于可扩展的基于大语言模型的人类移动性模拟
Yan, Hua, Tan, Heng, Zhang, Yingxue, Yang, Yu
Abstract
Large-scale human mobility simulation is critical for applications such as urban planning, epidemiology, and transportation analysis. Recent works treat large language models (LLMs) as human agents to simulate realistic mobility behaviors using structured reasoning, but their high computational cost limits scalability. To address this, we design a mobility-aware cache framework named MobCache that leverages reconstructible caches to enable efficient large-scale human mobility simulations. It consists of: (1) a reasoning component that encodes each reasoning step as a latent-space embedding and uses a latent-space evaluator to enable the reuse and recombination of reasoning steps; and (2) a decoding component that employs a lightweight decoder trained with mobility law-constrained distillation to translate latent-space reasoning chains into natural language, thereby improving simulation efficiency while maintaining fidelity. Experiments show that MobCache significantly improves efficiency across multiple dimensions while maintaining performance comparable to state-of-the-art LLM-based methods.
Chinese Translation
大规模人类移动性模拟对于城市规划、流行病学和交通分析等应用至关重要。近期的研究将大语言模型(LLMs)视为人类代理,利用结构化推理来模拟现实的移动行为,但其高计算成本限制了可扩展性。为此,我们设计了一种名为MobCache的面向移动性的缓存框架,利用可重构缓存以实现高效的大规模人类移动性模拟。该框架包括:(1) 一个推理组件,将每个推理步骤编码为潜在空间嵌入,并使用潜在空间评估器来实现推理步骤的重用和重组;(2) 一个解码组件,采用经过移动规律约束的蒸馏训练的轻量级解码器,将潜在空间推理链翻译为自然语言,从而在保持真实度的同时提高模拟效率。实验表明,MobCache在多个维度上显著提高了效率,同时保持了与最先进的基于LLM的方法相当的性能。
cs.AI / 5 / 2602.16763

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

当人工智能基准达到饱和:基准饱和的系统研究
Akhtar, Mubashara, Reuel, Anka, Soni, Prajna, Ahuja, Sanchit, Ammanamanchi, Pawan Sasanka, Rawal, Ruchit, Zouhar, Vilém, Yadav, Srishti, Whitehouse, Chenxi, Ki, Dayeon, Mickel, Jennifer, Choshen, Leshem, Šuppa, Marek, Batzner, Jan, Chim, Jenny, Sania, Jeba, Long, Yanan, Rahmani, Hossein A., Knight, Christina, Nan, Yiyang, Raj, Jyoutir, Fan, Yu, Singh, Shubham, Sahoo, Subramanyam, Habba, Eliya, Gohar, Usman, Pawar, Siddhesh, Scholz, Robert, Subramonian, Arjun, Ni, Jingwei, Kochenderfer, Mykel, Koyejo, Sanmi, Sachan, Mrinmaya, Biderman, Stella, Talat, Zeerak, Ghosh, Avijit, Solaiman, Irene
Abstract
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
Chinese Translation
人工智能(AI)基准在衡量模型开发进展和指导部署决策中发挥着核心作用。然而,许多基准很快就会达到饱和状态,这意味着它们无法再区分表现最佳的模型,从而降低了其长期价值。在本研究中,我们分析了从主要模型开发者的技术报告中选取的60个大型语言模型(LLM)基准的饱和情况。为了识别导致饱和的因素,我们从任务设计、数据构建和评估格式等14个属性对基准进行了特征化。我们测试了五个假设,以检验每个属性如何影响饱和率。我们的分析显示,近一半的基准表现出饱和现象,且随着基准的老化,饱和率逐渐增加。值得注意的是,隐藏测试数据(即公共与私有)并未显示出保护作用,而专家策划的基准比众包的基准更能抵御饱和。我们的研究结果强调了哪些设计选择可以延长基准的寿命,并为更持久的评估策略提供了指导。
cs.AI / 6 / 2602.16805

Simple Baselines are Competitive with Code Evolution

简单基线与代码演化具有竞争力
Gideoni, Yonatan, Risi, Sebastian, Gal, Yarin
Abstract
Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem's search space and domain knowledge in the prompt are chiefly what dictate a search's performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.
Chinese Translation
代码演化是一类依赖于大型语言模型的技术,通过演化或变异现有代码来搜索可能的计算机程序。许多提出的代码演化管道显示出令人印象深刻的性能,但往往没有与更简单的基线进行比较。我们测试了两个简单基线在三个领域的表现:寻找更好的数学界限、设计代理支架和机器学习竞赛。我们发现简单基线在所有三个领域的表现与更复杂的方法相匹配或超越。通过分析这些结果,我们发现代码演化在开发和使用方面存在各种不足。在数学界限方面,问题的搜索空间和提示中的领域知识主要决定了搜索的性能上限和效率,而代码演化管道则是次要的。因此,寻找改进界限的主要挑战在于设计良好的搜索空间,这由领域专家完成,而不是搜索本身。在设计代理支架时,我们发现支架的高方差与小数据集结合,导致选择到次优的支架,最终导致手工设计的多数投票支架表现最佳。我们提出了更好的评估方法,以减少评估的随机性,同时保持代码演化的经济可行性。最后,我们讨论了未来工作中促进更严格的代码演化的途径和最佳实践。
cs.AI / 7 / 2602.16807

Improved Upper Bounds for Slicing the Hypercube

改进的超立方体切割上界
Soiffer, Duncan, Itty, Nathaniel, Rosin, Christopher D., Bruell, Blake, DiCicco, Mason, Sárközy, Gábor N., Offstein, Ryan, Reichman, Daniel
Abstract
A collection of hyperplanes $\mathcal{H}$ slices all edges of the $n$-dimensional hypercube $Q_n$ with vertex set $\{-1,1\}^n$ if, for every edge $e$ in the hypercube, there exists a hyperplane in $\mathcal{H}$ intersecting $e$ in its interior. Let $S(n)$ be the minimum number of hyperplanes needed to slice $Q_n$. We prove that $S(n) \leq \lceil \frac{4n}{5} \rceil$, except when $n$ is an odd multiple of $5$, in which case $S(n) \leq \frac{4n}{5} +1$. This improves upon the previously known upper bound of $S(n) \leq \lceil\frac{5n}{6} \rceil$ due to Paterson reported in 1971. We also obtain new lower bounds on the maximum number of edges in $Q_n$ that can be sliced using $k
Chinese Translation
一组超平面 $ ext{H}$ 切割所有边的 $n$ 维超立方体 $Q_n$,其顶点集为 $ ext{-1,1}^n$,如果对于超立方体中的每一条边 $e$,存在一个超平面在 $ ext{H}$ 中与 $e$ 在其内部相交。设 $S(n)$ 为切割 $Q_n$ 所需的最小超平面数量。我们证明 $S(n) leq ext{⌈} rac{4n}{5} ext{⌉}$,除非 $n$ 是 $5$ 的奇数倍,在这种情况下 $S(n) leq rac{4n}{5} + 1$。这改进了1971年 Paterson 提出的已知上界 $S(n) leq ext{⌈} rac{5n}{6} ext{⌉}$。我们还获得了关于使用 $k
cs.AI / 8 / 2602.16812

NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron Crystallography

NeuDiff Agent:用于单晶中子晶体学的受控人工智能工作流程
Xiao, Zhongcan, Zhang, Leyi, Zhang, Guannan, Wang, Xiaoping
Abstract
Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a governed, tool-using AI workflow for TOPAZ at the Spallation Neutron Source that takes instrument data products through reduction, integration, refinement, and validation to a validated crystal structure and a publication-ready CIF. NeuDiff Agent executes this established pipeline under explicit governance by restricting actions to allowlisted tools, enforcing fail-closed verification gates at key workflow boundaries, and capturing complete provenance for inspection, auditing, and controlled replay. Performance is assessed using a fixed prompt protocol and repeated end-to-end runs with two large language model backends, with user and machine time partitioned and intervention burden and recovery behaviors quantified under gating. In a reference-case benchmark, NeuDiff Agent reduces wall time from 435 minutes (manual) to 86.5(4.7) to 94.4(3.5) minutes (4.6-5.0x faster) while producing a validated CIF with no checkCIF level A or B alerts. These results establish a practical route to deploy agentic AI in facility crystallography while preserving traceability and publication-facing validation requirements.
Chinese Translation
大型设施在科学产出中越来越面临分析和报告延迟的问题,尤其是对于需要迭代减少、整合、精炼和验证的结构和磁性复杂样品。为提高结果获取时间和分析效率,本文介绍了NeuDiff Agent,这是一种受控的、使用工具的人工智能工作流程,适用于中子源的TOPAZ,它将仪器数据产品通过减少、整合、精炼和验证处理,最终得到经过验证的晶体结构和可发表的CIF。NeuDiff Agent在明确的治理下执行这一既定流程,通过限制操作仅允许使用白名单工具,在关键工作流程边界强制实施失败关闭的验证门,并捕捉完整的来源信息以供检查、审计和受控重放。通过固定提示协议和与两个大型语言模型后端的重复端到端运行来评估性能,用户和机器时间被分开,并量化干预负担和恢复行为。在参考案例基准中,NeuDiff Agent将墙面时间从435分钟(手动)减少到86.5(4.7)到94.4(3.5)分钟(快4.6-5.0倍),同时生成没有checkCIF A或B级警报的经过验证的CIF。这些结果为在设施晶体学中部署代理人工智能提供了一条实际路径,同时保持可追溯性和面向出版的验证要求。
cs.AI / 9 / 2602.16814

Node Learning: A Framework for Adaptive, Decentralised and Collaborative Network Edge AI

节点学习:一种自适应、去中心化和协作的网络边缘人工智能框架
Kanjo, Eiman, Aslanov, Mustafa
Abstract
The expansion of AI toward the edge increasingly exposes the cost and fragility of cen- tralised intelligence. Data transmission, latency, energy consumption, and dependence on large data centres create bottlenecks that scale poorly across heterogeneous, mobile, and resource-constrained environments. In this paper, we introduce Node Learning, a decen- tralised learning paradigm in which intelligence resides at individual edge nodes and expands through selective peer interaction. Nodes learn continuously from local data, maintain their own model state, and exchange learned knowledge opportunistically when collaboration is beneficial. Learning propagates through overlap and diffusion rather than global synchro- nisation or central aggregation. It unifies autonomous and cooperative behaviour within a single abstraction and accommodates heterogeneity in data, hardware, objectives, and connectivity. This concept paper develops the conceptual foundations of this paradigm, contrasts it with existing decentralised approaches, and examines implications for communi- cation, hardware, trust, and governance. Node Learning does not discard existing paradigms, but places them within a broader decentralised perspective
Chinese Translation
人工智能向边缘的扩展日益暴露出集中式智能的成本和脆弱性。数据传输、延迟、能耗以及对大型数据中心的依赖在异构、移动和资源受限的环境中造成了难以扩展的瓶颈。在本文中,我们介绍了节点学习(Node Learning),一种去中心化的学习范式,其中智能驻留在各个边缘节点,并通过选择性同行互动进行扩展。节点从本地数据中持续学习,维护自己的模型状态,并在合作有利时机会性地交换学习到的知识。学习通过重叠和扩散传播,而不是通过全球同步或中心聚合。它在单一抽象中统一了自主和合作行为,并适应数据、硬件、目标和连接性的异质性。本文概念性地发展了这一范式的概念基础,与现有的去中心化方法进行了对比,并考察了对通信、硬件、信任和治理的影响。节点学习并不抛弃现有范式,而是将其置于更广泛的去中心化视角之中。
cs.AI / 10 / 2602.16827

An order-oriented approach to scoring hesitant fuzzy elements

面向顺序的犹豫模糊元素评分方法
Merino, Luis, Navarro, Gabriel, Salvatierra, Carlos, Santos, Evangelina
Abstract
Traditional scoring approaches on hesitant fuzzy sets often lack a formal base in order theory. This paper proposes a unified framework, where each score is explicitly defined with respect to a given order. This order-oriented perspective enables more flexible and coherent scoring mechanisms. We examine several classical orders on hesitant fuzzy elements, that is, nonempty subsets in [0,1], and show that, contrary to prior claims, they do not induce lattice structures. In contrast, we prove that the scores defined with respect to the symmetric order satisfy key normative criteria for scoring functions, including strong monotonicity with respect to unions and the G\"ardenfors condition. Following this analysis, we introduce a class of functions, called dominance functions, for ranking hesitant fuzzy elements. They aim to compare hesitant fuzzy elements relative to control sets incorporating minimum acceptability thresholds. Two concrete examples of dominance functions for finite sets are provided: the discrete dominance function and the relative dominance function. We show that these can be employed to construct fuzzy preference relations on typical hesitant fuzzy sets and support group decision-making.
Chinese Translation
传统的犹豫模糊集评分方法通常缺乏在序理论中的正式基础。本文提出了一个统一框架,其中每个评分都是相对于给定顺序明确定义的。这种面向顺序的视角使得评分机制更加灵活和一致。我们考察了几种经典的犹豫模糊元素的顺序,即在[0,1]中的非空子集,并表明,与之前的说法相反,它们并不诱导格结构。相反,我们证明了相对于对称顺序定义的评分满足评分函数的关键规范标准,包括相对于并集的强单调性和Gärdenfors条件。在此分析之后,我们引入了一类称为主导函数的函数,用于对犹豫模糊元素进行排序。它们旨在相对于包含最低可接受阈值的控制集比较犹豫模糊元素。我们提供了两个有限集的主导函数的具体例子:离散主导函数和相对主导函数。我们展示了这些函数可以用于构建典型犹豫模糊集上的模糊偏好关系,并支持群体决策。
cs.AI / 11 / 2602.16832

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

IndicJR:南亚语言中无评判的越狱鲁棒性基准
Pattnayak, Priyaranjan, Chowdhuri, Sanchari
Abstract
Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.
Chinese Translation
大型语言模型(LLMs)的安全对齐主要在英语中进行评估,并且受到合同限制,导致多语言的脆弱性研究不足。我们引入了 extbf{Indic Jailbreak Robustness (IJR)},这是一个针对12种印度及南亚语言(21亿说话者)的对抗安全性无评判基准,涵盖45216个提示,分为JSON(合同限制)和Free(自然语言)两个轨道。IJR揭示了三个模式。(1)合同增加了拒绝率,但并未阻止越狱:在JSON中,LLaMA和Sarvam的越狱成功率超过0.92,而在Free中,所有模型的越狱成功率均达到1.0,拒绝率崩溃。(2)英语到印度语言的攻击具有强转移性,格式包装器往往优于指令包装器。(3)正字法很重要:在JSON下,罗马化或混合输入降低了越狱成功率,与罗马化比例和分词(约0.28到0.32)之间的相关性表明了系统性影响。人工审核确认了检测器的可靠性,轻量到全面的比较保留了结论。IJR提供了一个可重复的多语言压力测试,揭示了仅依赖英语和合同集中评估所隐藏的风险,特别是对于经常切换语言和罗马化的南亚用户。
cs.AI / 12 / 2602.16855

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

移动代理-v3.5:多平台基础图形用户界面代理
Xu, Haiyang, Zhang, Xi, Liu, Haowei, Wang, Junyang, Zhu, Zhaozai, Zhou, Shengjie, Hu, Xuhao, Gao, Feiyu, Cao, Junjie, Wang, Zihua, Chen, Zhiyuan, Liao, Jitong, Zheng, Qi, Zeng, Jiahui, Xu, Ze, Bai, Shuai, Lin, Junyang, Zhou, Jingren, Yan, Ming
Abstract
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at https://github.com/X-PLUG/MobileAgent.
Chinese Translation
本文介绍了最新的本地图形用户界面代理模型GUI-Owl-1.5,该模型具有多种尺寸(2B/4B/8B/32B/235B)的指令/思维变体,并支持多种平台(桌面、移动、浏览器等),以实现云边协作和实时交互。GUI-Owl-1.5在20多个开源模型的图形用户界面基准测试中取得了最先进的结果:(1)在图形用户界面自动化任务中,OSWorld得分56.5,AndroidWorld得分71.6,WebArena得分48.4;(2)在基础任务中,ScreenSpotPro得分80.3;(3)在工具调用任务中,OSWorld-MCP得分47.6,MobileWorld得分46.8;(4)在记忆和知识任务中,GUI-Knowledge Bench得分75.5。GUI-Owl-1.5结合了几个关键创新:(1)混合数据飞轮:我们构建了基于模拟环境和云沙盒环境相结合的用户界面理解和轨迹生成的数据管道,以提高数据收集的效率和质量。(2)代理能力的统一增强:我们使用统一的思维合成管道来增强模型的推理能力,同时特别强调改善关键代理能力,包括工具/MCP使用、记忆和多代理适应;(3)多平台环境强化学习扩展:我们提出了一种新的环境强化学习算法MRPO,以应对多平台冲突和长时间任务低训练效率的挑战。GUI-Owl-1.5模型已开源,在线云沙盒演示可在https://github.com/X-PLUG/MobileAgent获取。
cs.AI / 13 / 2602.16891

OpenSage: Self-programming Agent Generation Engine

OpenSage:自编程代理生成引擎
Li, Hongwei, Wang, Zhun, Dai, Qinrun, Nie, Yuzhou, Peng, Jinjun, Liu, Ruitong, Zhang, Jingyang, Zhu, Kaijie, He, Jingxuan, Wang, Lun, Ding, Yangruibo, Chen, Yueqi, Guo, Wenbo, Song, Dawn
Abstract
Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents' performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents' generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with self-generated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.
Chinese Translation
代理开发工具包(ADKs)为构建代理提供了有效的平台和工具,其设计对构建代理的性能至关重要,尤其是在代理拓扑、工具和记忆功能方面。然而,目前的 ADKs 要么缺乏足够的功能支持,要么依赖人类手动设计这些组件,从而限制了代理的通用性和整体性能。我们提出了 OpenSage,这是第一个能够使大型语言模型(LLMs)自动创建具有自生成拓扑和工具集的代理的 ADK,同时提供全面且结构化的记忆支持。OpenSage 为代理创建和管理自己的子代理和工具包提供了有效的功能。它还具有一个分层的基于图的记忆系统,以实现高效管理,并配备了专门针对软件工程任务的工具包。在三个最先进的基准测试中,针对各种基础模型的广泛实验展示了 OpenSage 相较于现有 ADKs 的优势。我们还进行了严格的消融研究,以证明我们设计中每个组件的有效性。我们相信,OpenSage 可以为下一代代理开发铺平道路,将重点从以人为中心的范式转向以人工智能为中心的范式。
cs.AI / 14 / 2602.16901

AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks

AgentLAB:针对长时间攻击的LLM代理基准测试
Jiang, Tanqiu, Wang, Yuhui, Liang, Jiacheng, Wang, Ting
Abstract
LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horizon attacks that exploit multi-turn user-agent-environment interactions to achieve objectives infeasible in single-turn settings. To measure agent vulnerabilities to such risks, we present AgentLAB, the first benchmark dedicated to evaluating LLM agent susceptibility to adaptive, long-horizon attacks. Currently, AgentLAB supports five novel attack types including intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning, spanning 28 realistic agentic environments, and 644 security test cases. Leveraging AgentLAB, we evaluate representative LLM agents and find that they remain highly susceptible to long-horizon attacks; moreover, defenses designed for single-turn interactions fail to reliably mitigate long-horizon threats. We anticipate that AgentLAB will serve as a valuable benchmark for tracking progress on securing LLM agents in practical settings. The benchmark is publicly available at https://tanqiujiang.github.io/AgentLAB_main.
Chinese Translation
LLM代理越来越多地被部署在长时间、复杂的环境中以解决具有挑战性的问题,但这种扩展使它们暴露于长时间攻击之下,这些攻击利用多轮用户-代理-环境交互来实现单轮设置中不可行的目标。为了衡量代理对这些风险的脆弱性,我们提出了AgentLAB,这是第一个专门用于评估LLM代理对自适应长时间攻击的敏感性的基准。目前,AgentLAB支持五种新型攻击类型,包括意图劫持、工具链、任务注入、目标漂移和记忆中毒,涵盖28个现实的代理环境和644个安全测试用例。利用AgentLAB,我们评估了具有代表性的LLM代理,发现它们对长时间攻击仍然高度敏感;此外,针对单轮交互设计的防御措施无法可靠地减轻长时间威胁。我们预计AgentLAB将成为跟踪在实际环境中保护LLM代理进展的有价值基准。该基准已公开发布,网址为 https://tanqiujiang.github.io/AgentLAB_main。
cs.AI / 15 / 2602.16902

LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

LLM-WikiRace:基于真实世界知识图谱的长期规划与推理基准评估
Ziomek, Juliusz, Bankes, William, Wolf, Lorenz, Ramesh, Shyam Sundhar, Tang, Xiaohang, Bogunovic, Ilija
Abstract
We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.
Chinese Translation
我们介绍了LLM-WikiRace,这是一个用于评估大型语言模型(LLMs)在规划、推理和世界知识方面的基准。在LLM-WikiRace中,模型必须逐步高效地导航维基百科超链接,从给定的源页面到达目标页面,这要求模型具备前瞻性规划能力以及推理概念在现实世界中如何相互关联的能力。我们评估了一系列开源和闭源模型,包括Gemini-3、GPT-5和Claude Opus 4.5,它们在任务的简单级别上取得了最佳结果,并展示了超人类的表现。尽管如此,在困难级别上,性能急剧下降:表现最佳的模型Gemini-3在仅有23%的困难游戏中成功,突显了前沿模型面临的重大挑战。我们的分析表明,世界知识是成功的必要因素,但仅限于某个程度,超过这一阈值后,规划和长期推理能力成为主导因素。轨迹级别的分析进一步揭示,即使是最强的模型在失败后也难以重新规划,常常陷入循环而无法恢复。LLM-WikiRace是一个简单的基准,揭示了当前推理系统的明显局限性,提供了一个开放的舞台,在这里具备规划能力的LLMs仍有许多待证明的地方。我们的代码和排行榜可在https:/llmwikirace.github.io获取。
cs.AI / 16 / 2602.16931

Narrow fine-tuning erodes safety alignment in vision-language agents

狭义微调侵蚀视觉-语言智能体的安全对齐
Gulati, Idhant, Raval, Shivam
Abstract
Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
Chinese Translation
终身多模态智能体必须通过后训练不断适应新任务,但这在获取能力与保持安全对齐之间造成了根本性的紧张关系。我们证明了在狭域有害数据集上微调对齐的视觉-语言模型会导致严重的突现失调,这种失调在不相关的任务和模态中广泛泛化。通过对Gemma3-4B的实验,我们显示失调与LoRA秩单调增加,并且多模态评估显示出显著高于文本单一评估的失调水平($70.71 ext{±} 1.22$,在$r=128$时),而文本单一评估为($41.19 ext{±} 2.51$),这表明单模态安全基准可能低估了视觉-语言模型中的对齐退化。关键是,即使在训练混合中有10 ext{%}的有害数据也会导致显著的对齐退化。几何分析表明,有害行为占据了一个显著低维的子空间,大多数失调信息集中在10个主成分中。为了减轻失调,我们评估了两种策略:良性狭义微调和基于激活的引导。虽然这两种方法都显著减少了失调,但都未能完全消除学习到的有害行为。我们的发现强调了建立稳健的持续学习框架的必要性,因为当前的后训练范式可能无法充分保持后部署环境中的对齐。
cs.AI / 17 / 2602.16935

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

DeepContext:大规模语言模型中多轮对抗意图漂移的有状态实时检测
Albrethsen, Justin, Datta, Yash, Kumar, Kunal, Rajasekar, Sharath
Abstract
While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness facilitates a "Safety Gap" where adversarial tactics, like Crescendo and ActorAttack, slowly bleed malicious intent across turn boundaries to bypass stateless filters. We introduce DeepContext, a stateful monitoring framework designed to map the temporal trajectory of user intent. DeepContext discards the isolated evaluation model in favor of a Recurrent Neural Network (RNN) architecture that ingests a sequence of fine-tuned turn-level embeddings. By propagating a hidden state across the conversation, DeepContext captures the incremental accumulation of risk that stateless models overlook. Our evaluation demonstrates that DeepContext significantly outperforms existing baselines in multi-turn jailbreak detection, achieving a state-of-the-art F1 score of 0.84, which represents a substantial improvement over both hyperscaler cloud-provider guardrails and leading open-weight models such as Llama-Prompt-Guard-2 (0.67) and Granite-Guardian (0.67). Furthermore, DeepContext maintains a sub-20ms inference overhead on a T4 GPU, ensuring viability for real-time applications. These results suggest that modeling the sequential evolution of intent is a more effective and computationally efficient alternative to deploying massive, stateless models.
Chinese Translation
尽管大规模语言模型(LLM)的能力不断提升,但安全防护措施仍然主要是无状态的,将多轮对话视为一系列不相连的事件。这种缺乏时间意识的状态造成了一个“安全缺口”,使得对抗性策略(如 Crescendo 和 ActorAttack)能够在轮次边界之间缓慢渗透恶意意图,从而绕过无状态过滤器。我们提出了 DeepContext,这是一种有状态监测框架,旨在映射用户意图的时间轨迹。DeepContext 放弃了孤立的评估模型,采用了递归神经网络(RNN)架构,处理一系列经过微调的轮次级嵌入。通过在对话中传播隐藏状态,DeepContext 捕捉到无状态模型所忽视的风险逐步累积。我们的评估表明,DeepContext 在多轮越狱检测中显著优于现有基线,达到了 0.84 的最新 F1 分数,较超大规模云服务提供商的防护措施以及领先的开放权重模型(如 Llama-Prompt-Guard-2(0.67)和 Granite-Guardian(0.67))有了显著提升。此外,DeepContext 在 T4 GPU 上保持了低于 20 毫秒的推理开销,确保了实时应用的可行性。这些结果表明,建模意图的顺序演变是部署大规模无状态模型的更有效且计算上更高效的替代方案。
cs.AI / 18 / 2602.16942

SourceBench: Can AI Answers Reference Quality Web Sources?

SourceBench:人工智能答案能否引用高质量的网络来源?
Jin, Hexi, Liu, Stephen, Li, Yuheng, Malik, Simran, Zhang, Yiying
Abstract
Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
Chinese Translation
大型语言模型(LLMs)越来越多地通过引用网络来源来回答查询,但现有评估强调答案的正确性而非证据的质量。我们引入了SourceBench,这是一个用于衡量在100个真实世界查询中引用的网络来源质量的基准,这些查询涵盖了信息性、事实性、论证性、社交和购物意图。SourceBench使用一个涵盖内容质量(内容相关性、事实准确性、客观性)和页面级信号(例如,新鲜度、权威性/问责性、清晰度)的八项指标框架,并包括一个经过人工标注的数据集,配备了一个与专家判断高度一致的校准LLM评估器。我们使用SourceBench对八个LLM、谷歌搜索和三个AI搜索工具在3996个引用来源上进行了评估,并进行了进一步实验以理解评估结果。总体而言,我们的工作揭示了四个关键的新见解,这些见解可以指导未来在生成人工智能(GenAI)和网络搜索方向的研究。
cs.AI / 19 / 2602.16943

Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents

注意差距:文本安全性并不转移到大型语言模型代理的工具调用安全性
Cartagena, Arnold, Teixeira, Ariane
Abstract
Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action--a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.
Chinese Translation
作为代理部署的大型语言模型越来越多地通过工具调用与外部系统互动——这些行为具有现实世界的后果,而仅仅依靠文本输出无法体现。然而,安全评估主要测量文本级别的拒绝行为,留下了一个关键问题未得到解答:抑制有害文本的对齐是否也抑制有害行为?我们引入了GAP基准,这是一个系统的评估框架,用于测量大型语言模型代理在文本级安全性和工具调用级安全性之间的差异。我们在六个受监管领域(制药、金融、教育、就业、法律和基础设施)测试了六个前沿模型,每个领域七个越狱场景,三种系统提示条件(中性、安全强化和工具鼓励),以及两种提示变体,生成了17,420个可分析的数据点。我们的核心发现是,文本安全性并不转移到工具调用安全性。在所有六个模型中,我们观察到模型的文本输出拒绝了有害请求,而其工具调用同时执行了被禁止的行为——这种差异我们正式定义为GAP指标。即使在安全强化的系统提示下,这种情况在所有六个模型中仍然存在219个。系统提示的措辞对工具调用行为产生了显著影响:对于最强健的模型,工具调用安全率的差异达到21个百分点,而对于最敏感的模型则达到57个百分点,在经过Bonferroni校正后,18个成对消融比较中仍有16个保持显著性。运行时治理合同减少了所有六个模型的信息泄露,但对被禁止的工具调用尝试本身并未产生可检测的威慑效果。这些结果表明,仅依靠文本安全评估不足以评估代理行为,工具调用安全性需要专门的测量和缓解措施。
cs.AI / 20 / 2602.16953

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov:面向执行的代理学习用于高覆盖率测试平台生成
Zhang, Hejia, Yu, Zhongming, Ho, Chia-Tung, Ren, Haoxing, Khailany, Brucek, Zhao, Jishen
Abstract
Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.
Chinese Translation
面向执行的LLM代理为从工具反馈中学习提供了一个有前景的范式,但此类反馈往往获取成本高且速度慢,使得在线强化学习(RL)变得不切实际。高覆盖率硬件验证正是这一挑战的典型例子,因为它依赖于工业模拟器和不可微分的执行信号。我们提出了LLM4Cov,一个离线代理学习框架,将验证建模为由确定性评估器引导的无记忆状态转移。在此基础上,我们引入了执行验证的数据整理、策略感知的代理数据合成以及最坏状态优先采样,以实现可扩展的执行约束下的学习。我们进一步通过修订的评估协议整理了一个与现实对齐的基准,改编自现有的验证套件。使用所提出的管道,一个紧凑的4B参数模型在代理评估下实现了69.2%的覆盖率通过率,超越了其教师模型5.3%,并在与数量级更大的模型的比较中表现出竞争力。
cs.AI / 21 / 2602.16958

Automating Agent Hijacking via Structural Template Injection

通过结构化模板注入自动化代理劫持
Deng, Xinhao, Wu, Jiaqing, Chen, Miao, Xiao, Yue, Xu, Ke, Li, Qi
Abstract
Agent hijacking, highlighted by OWASP as a critical threat to the Large Language Model (LLM) ecosystem, enables adversaries to manipulate execution by injecting malicious instructions into retrieved content. Most existing attacks rely on manually crafted, semantics-driven prompt manipulation, which often yields low attack success rates and limited transferability to closed-source commercial models. In this paper, we propose Phantom, an automated agent hijacking framework built upon Structured Template Injection that targets the fundamental architectural mechanisms of LLM agents. Our key insight is that agents rely on specific chat template tokens to separate system, user, assistant, and tool instructions. By injecting optimized structured templates into the retrieved context, we induce role confusion and cause the agent to misinterpret the injected content as legitimate user instructions or prior tool outputs. To enhance attack transferability against black-box agents, Phantom introduces a novel attack template search framework. We first perform multi-level template augmentation to increase structural diversity and then train a Template Autoencoder (TAE) to embed discrete templates into a continuous, searchable latent space. Subsequently, we apply Bayesian optimization to efficiently identify optimal adversarial vectors that are decoded into high-potency structured templates. Extensive experiments on Qwen, GPT, and Gemini demonstrate that our framework significantly outperforms existing baselines in both Attack Success Rate (ASR) and query efficiency. Moreover, we identified over 70 vulnerabilities in real-world commercial products that have been confirmed by vendors, underscoring the practical severity of structured template-based hijacking and providing an empirical foundation for securing next-generation agentic systems.
Chinese Translation
代理劫持被OWASP强调为对大型语言模型(LLM)生态系统的重大威胁,使对手能够通过将恶意指令注入检索内容来操控执行。现有的大多数攻击依赖于手工制作的、基于语义的提示操控,这往往导致攻击成功率低且对闭源商业模型的迁移性有限。本文提出了Phantom,一个基于结构化模板注入的自动化代理劫持框架,旨在针对LLM代理的基本架构机制。我们的关键见解是,代理依赖特定的聊天模板标记来区分系统、用户、助手和工具指令。通过将优化的结构化模板注入到检索上下文中,我们引发角色混淆,使代理误将注入内容解读为合法的用户指令或先前的工具输出。为了增强对黑箱代理的攻击迁移性,Phantom引入了一种新颖的攻击模板搜索框架。我们首先进行多层次模板增强,以增加结构多样性,然后训练一个模板自编码器(Template Autoencoder, TAE)将离散模板嵌入到一个连续的、可搜索的潜在空间中。随后,我们应用贝叶斯优化有效识别解码为高效结构化模板的最佳对抗向量。在Qwen、GPT和Gemini上的大量实验表明,我们的框架在攻击成功率(Attack Success Rate, ASR)和查询效率方面显著优于现有基线。此外,我们在现实世界的商业产品中识别出超过70个漏洞,并得到了供应商的确认,突显了基于结构化模板的劫持的实际严重性,并为保护下一代代理系统提供了实证基础。
cs.AI / 22 / 2602.16976

HQFS: Hybrid Quantum Classical Financial Security with VQC Forecasting, QUBO Annealing, and Audit-Ready Post-Quantum Signing

HQFS:结合变分量子电路预测、QUBO退火和审计准备的混合量子经典金融安全
Nayak, Srikumar
Abstract
Here's the corrected paragraph with all punctuation and formatting issues fixed: Financial risk systems usually follow a two-step routine: a model predicts return or risk, and then an optimizer makes a decision such as a portfolio rebalance. In practice, this split can break under real constraints. The prediction model may look good, but the final decision can be unstable when the market shifts, when discrete constraints are added (lot sizes, caps), or when the optimization becomes slow for larger asset sets. Also, regulated settings need a clear audit trail that links each decision to the exact model state and inputs. We present HQFS, a practical hybrid pipeline that connects forecasting, discrete risk optimization, and auditability in one flow. First, HQFS learns next-step return and a volatility proxy using a variational quantum circuit (VQC) with a small classical head. Second, HQFS converts the risk-return objective and constraints into a QUBO and solves it with quantum annealing when available, while keeping a compatible classical QUBO solver as a fallback for deployment. Third, HQFS signs each rebalance output using a post-quantum signature so the allocation can be verified later without trusting the runtime environment. On our market dataset study, HQFS reduces return prediction error by 7.8% and volatility prediction error by 6.1% versus a tuned classical baseline. For the decision layer, HQFS improves out-of-sample Sharpe by 9.4% and lowers maximum drawdown by 11.7%. The QUBO solve stage also cuts average solve time by 28% compared to a mixed-integer baseline under the same constraints, while producing fully traceable, signed allocation records.
Chinese Translation
金融风险系统通常遵循两个步骤的常规:一个模型预测收益或风险,然后优化器做出决策,例如投资组合再平衡。在实际操作中,这种分离可能会在真实约束下失效。预测模型可能看起来良好,但当市场发生变化、添加离散约束(如批量大小、上限)或在较大资产集上优化变得缓慢时,最终决策可能会不稳定。此外,受监管的环境需要清晰的审计轨迹,将每个决策与确切的模型状态和输入相链接。我们提出了HQFS,一个实用的混合管道,将预测、离散风险优化和审计能力连接在一个流程中。首先,HQFS使用一个小型经典头的变分量子电路(VQC)学习下一步的收益和波动率代理。其次,HQFS将风险-收益目标和约束转换为QUBO,并在可用时使用量子退火进行求解,同时保留一个兼容的经典QUBO求解器作为部署的后备。第三,HQFS使用后量子签名对每个再平衡输出进行签名,以便在不信任运行时环境的情况下,后续可以验证分配。在我们的市场数据集研究中,HQFS将收益预测误差降低了7.8%,波动率预测误差降低了6.1%,相比于经过调优的经典基线。在决策层面,HQFS将样本外夏普比率提高了9.4%,并将最大回撤降低了11.7%。在相同约束下,QUBO求解阶段的平均求解时间也比混合整数基线减少了28%,同时生成完全可追溯的签名分配记录。
cs.AI / 23 / 2602.16984

Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

黑箱安全评估的基本限制:来自潜在上下文条件的 信息论和计算障碍
Srivastava, Vishal
Abstract
Black-box safety evaluation of AI systems assumes model behavior on test distributions reliably predicts deployment performance. We formalize and challenge this assumption through latent context-conditioned policies -- models whose outputs depend on unobserved internal variables that are rare under evaluation but prevalent under deployment. We establish fundamental limits showing that no black-box evaluator can reliably estimate deployment risk for such models. (1) Passive evaluation: For evaluators sampling i.i.d. from D_eval, we prove minimax lower bounds via Le Cam's method: any estimator incurs expected absolute error >= (5/24)*delta*L approximately 0.208*delta*L, where delta is trigger probability under deployment and L is the loss gap. (2) Adaptive evaluation: Using a hash-based trigger construction and Yao's minimax principle, worst-case error remains >= delta*L/16 even for fully adaptive querying when D_dep is supported over a sufficiently large domain; detection requires Theta(1/epsilon) queries. (3) Computational separation: Under trapdoor one-way function assumptions, deployment environments possessing privileged information can activate unsafe behaviors that any polynomial-time evaluator without the trapdoor cannot distinguish. For white-box probing, estimating deployment risk to accuracy epsilon_R requires O(1/(gamma^2 * epsilon_R^2)) samples, where gamma = alpha_0 + alpha_1 - 1 measures probe quality, and we provide explicit bias correction under probe error. Our results quantify when black-box testing is statistically underdetermined and provide explicit criteria for when additional safeguards -- architectural constraints, training-time guarantees, interpretability, and deployment monitoring -- are mathematically necessary for worst-case safety assurance.
Chinese Translation
人工智能系统的黑箱安全评估假设模型在测试分布上的行为能够可靠地预测部署性能。我们通过潜在上下文条件策略形式化并挑战这一假设——这些模型的输出依赖于在评估中罕见但在部署中普遍存在的未观察到的内部变量。我们建立了基本限制,表明没有任何黑箱评估器能够可靠地估计此类模型的部署风险。(1) 被动评估:对于从 D_eval 中独立同分布采样的评估器,我们通过 Le Cam 方法证明了最小最大下界:任何估计器的期望绝对误差 >= (5/24)*delta*L 约为 0.208*delta*L,其中 delta 是部署下的触发概率,L 是损失差距。(2) 自适应评估:使用基于哈希的触发构造和姚的最小最大原则,即使在 D_dep 在足够大的领域中得到支持的情况下,最坏情况下的误差仍然 >= delta*L/16;检测需要 Theta(1/epsilon) 次查询。(3) 计算分离:在陷门单向函数假设下,拥有特权信息的部署环境可以激活任何多项式时间评估器无法区分的安全行为。对于白箱探测,估计部署风险的精度为 epsilon_R 需要 O(1/(gamma^2 * epsilon_R^2)) 个样本,其中 gamma = alpha_0 + alpha_1 - 1 衡量探测质量,我们提供了在探测误差下的显式偏差修正。我们的结果量化了何时黑箱测试在统计上是欠定的,并提供了何时额外的安全保障——架构约束、训练时间保证、可解释性和部署监控——在数学上是确保最坏情况安全所必需的明确标准。
cs.AI / 24 / 2602.16990

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Conv-FinRe:一个基于对话和纵向的实用性基础金融推荐基准
Wang, Yan, Han, Yi, Qian, Lingfei, He, Yueru, Peng, Xueqing, Feng, Dongji, Xie, Zhuohan, Zhang, Vincent Jim, Guo, Rosie, Mo, Fengran, Huang, Jimin, Chen, Yankai, Liu, Xue, Nie, Jian-Yun
Abstract
Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.
Chinese Translation
大多数推荐基准评估模型模仿用户行为的能力。然而,在金融顾问领域,观察到的行为在市场波动下可能会受到噪声影响或短视,并可能与用户的长期目标相悖。因此,仅将用户选择视为唯一的真实依据,会将行为模仿与决策质量混淆。我们提出了Conv-FinRe,一个用于股票推荐的对话式和纵向基准,评估大型语言模型(LLMs)超越行为匹配。给定一个入职访谈、逐步市场背景和咨询对话,模型必须在固定的投资期限内生成排名。至关重要的是,Conv-FinRe提供了多视角的参考,区分描述性行为与基于投资者特定风险偏好的规范性实用性,从而能够诊断LLM是否遵循理性分析、模仿用户噪声或受到市场动量的驱动。我们从真实市场数据和人类决策轨迹构建了该基准,实例化了受控的咨询对话,并评估了一系列最先进的LLMs。结果揭示了理性决策质量与行为一致性之间的持续紧张关系:在基于实用性的排名中表现良好的模型往往无法匹配用户选择,而行为一致的模型则可能过拟合短期噪声。该数据集已在Hugging Face上公开发布,代码库可在GitHub上获取。
cs.AI / 25 / 2602.17001

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS:针对时间序列数据库的搜索-再验证自然语言查询
Tan, Zhao, Zhao, Yiji, Wang, Shiyu, Xu, Chang, Liang, Yuxuan, Liu, Xiping, Pan, Shirui, Jin, Ming
Abstract
Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.
Chinese Translation
时间序列数据库的自然语言查询(NLQ4TSDB)旨在帮助非专业用户从大量时间记录中检索有意义的事件、区间和摘要。然而,现有的文本到SQL方法并未针对形状或异常等连续形态意图进行设计,而时间序列模型在处理超长历史数据时也面临困难。为了解决这些挑战,我们提出了Sonar-TS,一个通过搜索-再验证管道来处理NLQ4TSDB的神经符号框架。类似于主动声纳,它利用特征索引通过SQL对候选窗口进行探测,随后生成Python程序以锁定并验证候选项与原始信号的匹配。为了实现有效评估,我们引入了NLQTSBench,这是第一个针对TSDB规模历史的NLQ设计的大规模基准。我们的实验突出了该领域内的独特挑战,并证明Sonar-TS能够有效处理传统方法无法应对的复杂时间查询。本研究呈现了NLQ4TSDB的首次系统性研究,提供了一个通用框架和评估标准,以促进未来的研究。
cs.AI / 26 / 2602.17015

Cinder: A fast and fair matchmaking system

Cinder:一个快速且公平的匹配系统
Pal, Saurav
Abstract
A fair and fast matchmaking system is an important component of modern multiplayer online games, directly impacting player retention and satisfaction. However, creating fair matches between lobbies (pre-made teams) of heterogeneous skill levels presents a significant challenge. Matching based simply on average team skill metrics, such as mean or median rating or rank, often results in unbalanced and one-sided games, particularly when skill distributions are wide or skewed. This paper introduces Cinder, a two-stage matchmaking system designed to provide fast and fair matches. Cinder first employs a rapid preliminary filter by comparing the "non-outlier" skill range of lobbies using the Ruzicka similarity index. Lobbies that pass this initial check are then evaluated using a more precise fairness metric. This second stage involves mapping player ranks to a non-linear set of skill buckets, generated from an inverted normal distribution, to provide higher granularity at average skill levels. The fairness of a potential match is then quantified using the Kantorovich distance on the lobbies' sorted bucket indices, producing a "Sanction Score." We demonstrate the system's viability by analyzing the distribution of Sanction Scores from 140 million simulated lobby pairings, providing a robust foundation for fair matchmaking thresholds.
Chinese Translation
一个公平且快速的匹配系统是现代多人在线游戏的重要组成部分,直接影响玩家的留存率和满意度。然而,在具有不同技能水平的团队(预组队)之间创建公平的匹配是一项重大挑战。仅仅基于平均团队技能指标(如平均值或中位数评分或排名)进行匹配,往往会导致不平衡和单方面的游戏,特别是在技能分布广泛或偏斜的情况下。本文介绍了Cinder,一个旨在提供快速和公平匹配的两阶段匹配系统。Cinder首先通过使用Ruzicka相似度指数比较团队的“非异常值”技能范围,进行快速的初步筛选。通过初步检查的团队随后使用更精确的公平性指标进行评估。第二阶段涉及将玩家排名映射到一组非线性的技能桶,这些技能桶是从反正态分布生成的,以在平均技能水平上提供更高的粒度。潜在匹配的公平性随后通过对团队的排序桶索引使用Kantorovich距离进行量化,从而产生“制裁分数”。我们通过分析1.4亿个模拟团队配对的制裁分数分布,展示了该系统的可行性,为公平匹配阈值提供了坚实的基础。
cs.AI / 27 / 2602.17016

M2F: Automated Formalization of Mathematical Literature at Scale

M2F:大规模数学文献的自动形式化
Wang, Zichen, Ma, Wanli, Ming, Zhenyu, Zhang, Gong, Yuan, Kun, Wen, Zaiwen
Abstract
Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross-file dependencies, resolving imports, and ensuring that entire projects compile end-to-end. We present M2F (Math-to-Formal), the first agentic framework for end-to-end, project-scale autoformalization in Lean. The framework operates in two stages. The statement compilation stage splits the document into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles, allowing placeholders in proofs. The proof repair stage closes these holes under fixed signatures using goal-conditioned local edits. Throughout both stages, M2F keeps the verifier in the loop, committing edits only when toolchain feedback confirms improvement. In approximately three weeks, M2F converts long-form mathematical sources into a project-scale Lean library of 153,853 lines from 479 pages textbooks on real analysis and convex analysis, fully formalized as Lean declarations with accompanying proofs. This represents textbook-scale formalization at a pace that would typically require months or years of expert effort. On FATE-H, we achieve $96\%$ proof success (vs.\ $80\%$ for a strong baseline). Together, these results demonstrate that practical, large-scale automated formalization of mathematical literature is within reach. The full generated Lean code from our runs is available at https://github.com/optsuite/ReasBook.git.
Chinese Translation
数学的自动形式化使得机械验证成为可能,但仍然局限于孤立的定理和短小片段。将其扩展到教科书和研究论文的工作尚未得到充分解决,因为这需要管理跨文件的依赖关系、解析导入,并确保整个项目能够从头到尾编译。我们提出了M2F(Math-to-Formal),这是第一个用于在Lean中进行端到端项目规模自动形式化的代理框架。该框架分为两个阶段。声明编译阶段将文档拆分为原子块,通过推断的依赖关系对其进行排序,并修复声明框架,直到项目能够编译,允许在证明中使用占位符。证明修复阶段在固定签名下使用目标条件的局部编辑来填补这些空白。在这两个阶段中,M2F始终将验证器纳入循环,仅在工具链反馈确认改进时才提交编辑。在大约三周的时间里,M2F将长篇数学来源转换为一个包含153,853行的项目规模Lean库,涵盖了479页的实分析和凸分析教科书,完全形式化为Lean声明,并附有证明。这代表了教科书规模的形式化,其速度通常需要专家数月或数年的努力。在FATE-H上,我们实现了$96\%$的证明成功率(相比之下,强基线为$80\\%$)。这些结果共同表明,数学文献的实用大规模自动形式化已触手可及。我们运行生成的完整Lean代码可在https://github.com/optsuite/ReasBook.git获取。
cs.AI / 28 / 2602.17017

Sales Research Agent and Sales Research Bench

销售研究代理与销售研究基准
Bhol, Deepanjan
Abstract
Enterprises increasingly need AI systems that can answer sales-leader questions over live, customized CRM data, but most available models do not expose transparent, repeatable evidence of quality. This paper describes the Sales Research Agent in Microsoft Dynamics 365 Sales, an AI-first application that connects to live CRM and related data, reasons over complex schemas, and produces decision-ready insights through text and chart outputs. To make quality observable, we introduce the Sales Research Bench, a purpose-built benchmark that scores systems on eight customer-weighted dimensions, including text and chart groundedness, relevance, explainability, schema accuracy, and chart quality. In a 200-question run on a customized enterprise schema on October 19, 2025, the Sales Research Agent outperformed Claude Sonnet 4.5 by 13 points and ChatGPT-5 by 24.1 points on the 100-point composite score, giving customers a repeatable way to compare AI solutions.
Chinese Translation
企业越来越需要能够基于实时定制的客户关系管理(CRM)数据回答销售领导问题的人工智能系统,但大多数现有模型并未提供透明、可重复的质量证据。本文描述了微软Dynamics 365 Sales中的销售研究代理,这是一个以人工智能为核心的应用,能够连接实时CRM及相关数据,推理复杂的模式,并通过文本和图表输出生成决策准备好的洞察。为了使质量可观察,我们引入了销售研究基准,这是一个专门构建的基准,基于八个客户加权维度对系统进行评分,包括文本和图表的基础性、相关性、可解释性、模式准确性和图表质量。在2025年10月19日对一个定制企业模式进行的200个问题的测试中,销售研究代理在100分的综合评分中超越了Claude Sonnet 4.5,领先13分,超越了ChatGPT-5,领先24.1分,为客户提供了一种可重复比较人工智能解决方案的方法。
cs.AI / 29 / 2602.17038

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

面向阶段的专家混合模型在自主强化学习中的应用
Yang, Shengtian, Li, Yu, He, Shuo, Li, Yewen, Cai, Qingpeng, Jiang, Peng, Feng, Lei
Abstract
Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.
Chinese Translation
强化学习(RL)使得大规模语言模型(LLM)代理具备了解决复杂任务的强大能力。然而,现有的强化学习方法通常使用 extit{单一}策略网络,导致 extit{简单性偏差},即简单任务占据了大部分参数并主导了梯度更新,从而使复杂任务的容量不足。一种可行的解决方案是采用专家混合(Mixture-of-Experts, MoE)架构在策略网络中,因为MoE允许不同的参数(专家)专注于不同的任务,防止简单任务主导所有参数。然而,传统MoE的一个关键限制是其基于令牌的路由,其中路由器将每个令牌分配给专门的专家,这会将阶段一致的模式分散到不同的专家分配中,从而削弱了专家的专业化。在本文中,我们提出了 extbf{面向阶段的专家混合模型(Phase-Aware Mixture of Experts, PA-MoE)}。该模型首先具有一个轻量级的 extit{阶段路由器},该路由器直接从强化学习目标中学习潜在的阶段边界,而无需预先定义阶段类别。然后,阶段路由器为同一专家分配时间一致的任务,从而使专家能够保持阶段特定的专业知识。实验结果证明了我们提出的PA-MoE的有效性。
cs.AI / 30 / 2602.17046

Dynamic System Instructions and Tool Exposure for Efficient Agentic LLMs

高效代理型大型语言模型的动态系统指令与工具曝光
Franko, Uria
Abstract
Large Language Model (LLM) agents often run for many steps while re-ingesting long system instructions and large tool catalogs each turn. This increases cost, agent derailment probability, latency, and tool-selection errors. We propose Instruction-Tool Retrieval (ITR), a RAG variant that retrieves, per step, only the minimal system-prompt fragments and the smallest necessary subset of tools. ITR composes a dynamic runtime system prompt and exposes a narrowed toolset with confidence-gated fallbacks. Using a controlled benchmark with internally consistent numbers, ITR reduces per-step context tokens by 95%, improves correct tool routing by 32% relative, and cuts end-to-end episode cost by 70% versus a monolithic baseline. These savings enable agents to run 2-20x more loops within context limits. Savings compound with the number of agent steps, making ITR particularly valuable for long-running autonomous agents. We detail the method, evaluation protocol, ablations, and operational guidance for practical deployment.
Chinese Translation
大型语言模型(LLM)代理在每个回合中通常需要多次运行,同时重新处理冗长的系统指令和庞大的工具目录。这增加了成本、代理偏离的概率、延迟以及工具选择错误。我们提出了指令-工具检索(Instruction-Tool Retrieval, ITR),这是一种检索增强生成(RAG)变体,每一步仅检索最小的系统提示片段和必要的最小工具子集。ITR 组合了动态运行时系统提示,并以信心门控的回退方式暴露缩小的工具集。通过使用具有内部一致性数字的受控基准,ITR 将每步的上下文标记减少了 95%,相对提高了 32% 的正确工具路由,并将与单一基线相比的端到端情节成本降低了 70%。这些节省使得代理能够在上下文限制内运行 2-20 倍更多的循环。随着代理步骤数量的增加,节省效果会累积,使得 ITR 对于长时间运行的自主代理特别有价值。我们详细介绍了该方法、评估协议、消融实验以及实际部署的操作指导。
cs.AI / 31 / 2602.17049

IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents

IntentCUA:用于技能抽象和计算机使用代理的多智能体规划的意图级表示学习
Lee, Seoyoung, Yoon, Seobin, Lee, Seongbeen, Chun, Yoojung, Park, Dayoung, Kim, Doyeon, Sim, Joo Yong
Abstract
Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
Chinese Translation
计算机使用代理在噪声感知、多窗口上下文和不断变化的环境状态下进行长时间的操作。现有方法,从基于强化学习(RL)的规划器到轨迹检索,往往偏离用户意图,并重复解决常规子问题,导致错误积累和效率低下。我们提出了IntentCUA,一个旨在通过与意图对齐的计划记忆来稳定长时间执行的多智能体计算机使用框架。规划器、计划优化器和评论者在共享内存上协调工作,将原始交互轨迹抽象为多视角意图表示和可重用技能。在运行时,意图原型检索与子组对齐的技能,并将其注入部分计划中,从而减少冗余的重新规划,并减轻桌面应用程序中的错误传播。在端到端评估中,IntentCUA实现了74.83%的任务成功率和0.91的步骤效率比,优于基于RL和轨迹中心的基线。消融实验表明,多视角意图抽象和共享计划记忆共同提高了执行稳定性,合作的多智能体循环在长时间任务上提供了最大的收益。这些结果强调了系统级意图抽象和基于记忆的协调对于在大型动态环境中实现可靠和高效的桌面自动化的关键作用。
cs.AI / 32 / 2602.17053

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

RFEval:在大规模推理模型中基于反事实推理干预评估推理可信性
Han, Yunseok, Lee, Yejoon, Do, Jaeyoung
Abstract
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: $\href{https://aidaslab.github.io/RFEval/}{https://aidaslab.github.io/RFEval/}$
Chinese Translation
大型推理模型(LRMs)表现出强大的性能,但往往产生听起来合理但未能反映其真实决策过程的推理,从而削弱了可靠性和信任度。我们引入了一个关于推理可信性的正式框架,该框架由两个可测试的条件定义:立场一致性(将推理与答案联系起来的连贯立场)和因果影响(所述推理在输出级干预下因果驱动答案),明确与准确性解耦。为了实现这一目标,我们提出了RFEval,这是一个涵盖七个任务的7186个实例的基准,通过受控的输出级反事实干预来探测可信性。在评估十二个开源LRM时,我们发现49.7%的输出存在不可信现象,主要源于立场不一致。失败集中在脆弱的、趋同的领域,如数学和代码,并且与后训练阶段的关系更大,而不是与规模相关:同类消融实验表明,在监督微调的基础上增加当前的强化学习(RL)风格目标可能会降低推理可信性,即使在保持准确性的情况下。至关重要的是,准确性既不是推理可信性的充分条件,也不是可靠的代理:一旦控制了模型和任务,准确性与可信性之间的联系就变得微弱且统计上不显著。我们的工作建立了审计LRM可靠性的严格方法论,并表明可信的人工智能不仅需要优化正确的结果,还需要优化推理过程的结构完整性。我们的代码和数据集可以在项目页面找到:$ ext{https://aidaslab.github.io/RFEval/}$
cs.AI / 33 / 2602.17062

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

在多智能体强化学习中保留次优动作以跟随变化的最优解
Jo, Yonghyeon, Lee, Sunwoo, Han, Seungyul
Abstract
Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.
Chinese Translation
价值分解是合作多智能体强化学习(MARL)的核心方法。然而,现有方法仍依赖于单一的最优动作,并在训练过程中当基础价值函数发生变化时难以适应,常常收敛到次优策略。为了解决这一局限性,我们提出了连续次值 Q 学习(Successive Sub-value Q-learning, S2Q),该方法学习多个次值函数以保留替代的高价值动作。将这些次值函数纳入基于 Softmax 的行为策略中,S2Q 鼓励持续探索,并使 $Q^{ ext{tot}}$ 能够快速调整以适应变化的最优解。在具有挑战性的 MARL 基准测试中的实验确认,S2Q 始终优于各种 MARL 算法,展现出更好的适应性和整体性能。我们的代码可在 https://github.com/hyeon1996/S2Q 获取。
cs.AI / 34 / 2602.17066

Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization

预测批量调度:通过损失感知样本优先级加速语言模型训练
Rasal, Sumedh
Abstract
We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning approaches that require predefined difficulty metrics or hard example mining methods that demand expensive per-sample loss tracking, PBS employs a lightweight linear predictor trained online to estimate sample difficulty from static token-level features. Our predictor achieves 0.44 correlation with actual loss using only four simple features: token frequency, sequence length, vocabulary diversity, and rare token ratio. Experiments on a 130M parameter transformer demonstrate that PBS achieves 6-13\% faster convergence measured by evaluation loss across training checkpoints, with the predictor's correlation improving from 0.14 to 0.44 over 10,000 training steps. These results validate that token frequency statistics encode meaningful information about sample difficulty, enabling effective curriculum learning with negligible computational overhead.
Chinese Translation
我们提出了一种新的训练优化技术——预测批量调度(Predictive Batch Scheduling, PBS),通过在批量构建过程中动态优先考虑高损失样本,加速语言模型的收敛。与需要预定义难度指标的课程学习方法或要求昂贵的逐样本损失跟踪的困难样本挖掘方法不同,PBS采用了一种在线训练的轻量级线性预测器,根据静态的标记级特征来估计样本的难度。我们的预测器仅使用四个简单特征:标记频率、序列长度、词汇多样性和稀有标记比例,便实现了与实际损失的0.44相关性。在一个130M参数的变换器上进行的实验表明,PBS在训练检查点的评估损失上实现了6-13%的更快收敛,且预测器的相关性在10,000个训练步骤中从0.14提高到0.44。这些结果验证了标记频率统计编码了关于样本难度的有意义信息,从而使得有效的课程学习成为可能,并且几乎没有计算开销。
cs.AI / 35 / 2602.17084

How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses

AI 编码代理的沟通方式:拉取请求描述特征与人类审查响应的研究
Watanabe, Kan, Tsuchida, Rikuto, Monno, Takahiro, Huang, Bin, Yamasaki, Kazuma, Fan, Youmei, Shimari, Kazumasa, Matsumoto, Kenichi
Abstract
The rapid adoption of large language models has led to the emergence of AI coding agents that autonomously create pull requests on GitHub. However, how these agents differ in their pull request description characteristics, and how human reviewers respond to them, remains underexplored. In this study, we conduct an empirical analysis of pull requests created by five AI coding agents using the AIDev dataset. We analyze agent differences in pull request description characteristics, including structural features, and examine human reviewer response in terms of review activity, response timing, sentiment, and merge outcomes. We find that AI coding agents exhibit distinct PR description styles, which are associated with differences in reviewer engagement, response time, and merge outcomes. We observe notable variation across agents in both reviewer interaction metrics and merge rates. These findings highlight the role of pull request presentation and reviewer interaction dynamics in human-AI collaborative software development.
Chinese Translation
大型语言模型的快速采用导致了 AI 编码代理的出现,这些代理能够自主在 GitHub 上创建拉取请求。然而,这些代理在拉取请求描述特征上的差异,以及人类审查者对其的响应,仍然未得到充分探索。在本研究中,我们使用 AIDev 数据集对五个 AI 编码代理创建的拉取请求进行实证分析。我们分析了代理在拉取请求描述特征上的差异,包括结构特征,并考察了人类审查者在审查活动、响应时机、情感和合并结果等方面的响应。我们发现,AI 编码代理展现出独特的拉取请求描述风格,这与审查者的参与度、响应时间和合并结果存在关联。我们观察到不同代理在审查者互动指标和合并率方面存在显著差异。这些发现突显了拉取请求呈现方式和审查者互动动态在人工智能与人类协作软件开发中的重要性。
cs.AI / 36 / 2602.17096

Agentic Wireless Communication for 6G: Intent-Aware and Continuously Evolving Physical-Layer Intelligence

6G的自主无线通信:意图感知与持续演变的物理层智能
Li, Zhaoyang, Jin, Xingzhi, Pan, Junyu, Yang, Qianqian, Shi, Zhiguo
Abstract
As 6G wireless systems evolve, growing functional complexity and diverse service demands are driving a shift from rule-based control to intent-driven autonomous intelligence. User requirements are no longer captured by a single metric (e.g., throughput or reliability), but by multi-dimensional objectives such as latency sensitivity, energy preference, computational constraints, and service-level requirements. These objectives may also change over time due to environmental dynamics and user-network interactions. Therefore, accurate understanding of both the communication environment and user intent is critical for autonomous and sustainably evolving 6G communications. Large language models (LLMs), with strong contextual understanding and cross-modal reasoning, provide a promising foundation for intent-aware network agents. Compared with rule-driven or centrally optimized designs, LLM-based agents can integrate heterogeneous information and translate natural-language intents into executable control and configuration decisions. Focusing on a closed-loop pipeline of intent perception, autonomous decision making, and network execution, this paper investigates agentic AI for the 6G physical layer and its realization pathways. We review representative physical-layer tasks and their limitations in supporting intent awareness and autonomy, identify application scenarios where agentic AI is advantageous, and discuss key challenges and enabling technologies in multimodal perception, cross-layer decision making, and sustainable optimization. Finally, we present a case study of an intent-driven link decision agent, termed AgenCom, which adaptively constructs communication links under diverse user preferences and channel conditions.
Chinese Translation
随着6G无线系统的发展,日益增长的功能复杂性和多样化的服务需求推动了从基于规则的控制向以意图驱动的自主智能的转变。用户需求不再仅仅通过单一指标(例如吞吐量或可靠性)来捕捉,而是通过多维目标,如延迟敏感性、能量偏好、计算约束和服务水平要求。这些目标也可能因环境动态和用户与网络的交互而随时间变化。因此,准确理解通信环境和用户意图对于自主和可持续演变的6G通信至关重要。大型语言模型(LLMs)具有强大的上下文理解和跨模态推理能力,为意图感知网络代理提供了有前景的基础。与基于规则或集中优化设计相比,基于LLM的代理能够整合异构信息,并将自然语言意图转化为可执行的控制和配置决策。本文聚焦于意图感知、自动决策和网络执行的闭环流程,探讨了6G物理层的自主人工智能及其实现路径。我们回顾了代表性的物理层任务及其在支持意图感知和自主性方面的局限性,识别了自主人工智能具有优势的应用场景,并讨论了多模态感知、跨层决策和可持续优化中的关键挑战和促进技术。最后,我们展示了一个以意图驱动的链路决策代理的案例研究,称为AgenCom,该代理能够在多样的用户偏好和信道条件下自适应地构建通信链路。
cs.AI / 37 / 2602.17106

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction

迈向可信赖的可持续性评级方法评估:人机协作框架用于基准数据集构建
Cai, Xiaoran, Yang, Wang, Ren, Xiyu, Law, Chekun, Sharma, Rohit, Qi, Peng
Abstract
Sustainability or ESG rating agencies use company disclosures and external data to produce scores or ratings that assess the environmental, social, and governance performance of a company. However, sustainability ratings across agencies for a single company vary widely, limiting their comparability, credibility, and relevance to decision-making. To harmonize the rating results, we propose adopting a universal human-AI collaboration framework to generate trustworthy benchmark datasets for evaluating sustainability rating methodologies. The framework comprises two complementary parts: STRIDE (Sustainability Trust Rating & Integrity Data Equation) provides principled criteria and a scoring system that guide the construction of firm-level benchmark datasets using large language models (LLMs), and SR-Delta, a discrepancy-analysis procedural framework that surfaces insights for potential adjustments. The framework enables scalable and comparable assessment of sustainability rating methodologies. We call on the broader AI community to adopt AI-powered approaches to strengthen and advance sustainability rating methodologies that support and enforce urgent sustainability agendas.
Chinese Translation
可持续性或环境、社会和治理(ESG)评级机构利用公司披露的信息和外部数据来生成评估公司环境、社会和治理表现的分数或评级。然而,不同机构对单一公司的可持续性评级差异很大,限制了其可比性、可信度和决策相关性。为了协调评级结果,我们建议采用一种通用的人机协作框架,以生成可信赖的基准数据集,用于评估可持续性评级方法。该框架由两个互补部分组成:STRIDE(可持续性信任评级与完整性数据方程)提供原则性标准和评分系统,指导利用大型语言模型(LLMs)构建公司级基准数据集;SR-Delta则是一个差异分析程序框架,揭示潜在调整的见解。该框架能够实现可持续性评级方法的可扩展和可比评估。我们呼吁更广泛的人工智能社区采用人工智能驱动的方法,以加强和推进支持和执行紧迫可持续性议程的可持续性评级方法。
cs.AI / 38 / 2602.17107

Owen-based Semantics and Hierarchy-Aware Explanation (O-Shap)

基于Owen的语义和层次感知解释 (O-Shap)
Zhou, Xiangyu, Xiao, Chenhan, Weng, Yang
Abstract
Shapley value-based methods have become foundational in explainable artificial intelligence (XAI), offering theoretically grounded feature attributions through cooperative game theory. However, in practice, particularly in vision tasks, the assumption of feature independence breaks down, as features (i.e., pixels) often exhibit strong spatial and semantic dependencies. To address this, modern SHAP implementations now include the Owen value, a hierarchical generalization of the Shapley value that supports group attributions. While the Owen value preserves the foundations of Shapley values, its effectiveness critically depends on how feature groups are defined. We show that commonly used segmentations (e.g., axis-aligned or SLIC) violate key consistency properties, and propose a new segmentation approach that satisfies the $T$-property to ensure semantic alignment across hierarchy levels. This hierarchy enables computational pruning while improving attribution accuracy and interpretability. Experiments on image and tabular datasets demonstrate that O-Shap outperforms baseline SHAP variants in attribution precision, semantic coherence, and runtime efficiency, especially when structure matters.
Chinese Translation
基于Shapley值的方法已成为可解释人工智能(XAI)的基础,通过合作博弈论提供理论基础的特征归因。然而,在实际应用中,特别是在视觉任务中,特征独立性的假设往往会失效,因为特征(即像素)通常表现出强烈的空间和语义依赖性。为了解决这个问题,现代SHAP实现现在包括Owen值,这是Shapley值的层次化推广,支持群体归因。虽然Owen值保留了Shapley值的基础,但其有效性在很大程度上依赖于特征组的定义。我们展示了常用的分割方法(例如,轴对齐或SLIC)违反了关键的一致性属性,并提出了一种新的分割方法,该方法满足$T$-属性,以确保在层次级别之间的语义对齐。这个层次结构使得计算修剪成为可能,同时提高了归因的准确性和可解释性。在图像和表格数据集上的实验表明,O-Shap在归因精度、语义一致性和运行效率方面优于基线SHAP变体,尤其是在结构重要的情况下。
cs.AI / 39 / 2602.17111

Instructor-Aligned Knowledge Graphs for Personalized Learning

与教师对齐的知识图谱用于个性化学习
AlRabah, Abdulrahman, Kargupta, Priyanka, Han, Jiawei, Alawini, Abdussalam
Abstract
Mastering educational concepts requires understanding both their prerequisites (e.g., recursion before merge sort) and sub-concepts (e.g., merge sort as part of sorting algorithms). Capturing these dependencies is critical for identifying students' knowledge gaps and enabling targeted intervention for personalized learning. This is especially challenging in large-scale courses, where instructors cannot feasibly diagnose individual misunderstanding or determine which concepts need reinforcement. While knowledge graphs offer a natural representation for capturing these conceptual relationships at scale, existing approaches are either surface-level (focusing on course-level concepts like "Algorithms" or logistical relationships such as course enrollment), or disregard the rich pedagogical signals embedded in instructional materials. We propose InstructKG, a framework for automatically constructing instructor-aligned knowledge graphs that capture a course's intended learning progression. Given a course's lecture materials (slides, notes, etc.), InstructKG extracts significant concepts as nodes and infers learning dependencies as directed edges (e.g., "part-of" or "depends-on" relationships). The framework synergizes the rich temporal and semantic signals unique to educational materials (e.g., "recursion" is taught before "mergesort"; "recursion" is mentioned in the definition of "merge sort") with the generalizability of large language models. Through experiments on real-world, diverse lecture materials across multiple courses and human-based evaluation, we demonstrate that InstructKG captures rich, instructor-aligned learning progressions.
Chinese Translation
掌握教育概念需要理解其先决条件(例如,在归并排序之前理解递归)和子概念(例如,归并排序作为排序算法的一部分)。捕捉这些依赖关系对于识别学生的知识差距并实现针对性的个性化学习干预至关重要。这在大规模课程中尤其具有挑战性,因为教师无法有效地诊断个体的误解或确定哪些概念需要加强。虽然知识图谱提供了一种自然的方式来捕捉这些概念关系,但现有的方法要么停留在表面(关注课程级别的概念,如“算法”或课程注册等后勤关系),要么忽视了教学材料中蕴含的丰富教育信号。我们提出了InstructKG,一个自动构建与教师对齐的知识图谱的框架,旨在捕捉课程的预期学习进程。给定课程的讲义材料(幻灯片、笔记等),InstructKG提取重要概念作为节点,并推断学习依赖关系作为有向边(例如,“部分”或“依赖于”关系)。该框架将教育材料中独特的丰富时间和语义信号(例如,“递归”在“归并排序”之前教授;“递归”在“归并排序”的定义中提到)与大型语言模型的普适性相结合。通过对多个课程的真实、多样化讲义材料进行实验和基于人类的评估,我们证明了InstructKG能够捕捉丰富的、与教师对齐的学习进程。
cs.AI / 40 / 2602.17116

Epistemology of Generative AI: The Geometry of Knowing

生成性人工智能的认识论:知识的几何学
Levin, Ilya
Abstract
Generative AI presents an unprecedented challenge to our understanding of knowledge and its production. Unlike previous technological transformations, where engineering understanding preceded or accompanied deployment, generative AI operates through mechanisms whose epistemic character remains obscure, and without such understanding, its responsible integration into science, education, and institutional life cannot proceed on a principled basis. This paper argues that the missing account must begin with a paradigmatic break that has not yet received adequate philosophical attention. In the Turing-Shannon-von Neumann tradition, information enters the machine as encoded binary vectors, and semantics remains external to the process. Neural network architectures rupture this regime: symbolic input is instantly projected into a high-dimensional space where coordinates correspond to semantic parameters, transforming binary code into a position in a geometric space of meanings. It is this space that constitutes the active epistemic condition shaping generative production. Drawing on four structural properties of high-dimensional geometry concentration of measure, near-orthogonality, exponential directional capacity, and manifold regularity the paper develops an Indexical Epistemology of High-Dimensional Spaces. Building on Peirce semiotics and Papert constructionism, it reconceptualizes generative models as navigators of learned manifolds and proposes navigational knowledge as a third mode of knowledge production, distinct from both symbolic reasoning and statistical recombination.
Chinese Translation
生成性人工智能对我们对知识及其生产的理解提出了前所未有的挑战。与以往技术变革不同,在这些变革中,工程理解在部署之前或同时进行,生成性人工智能通过其认识论特征仍然模糊的机制运作,而没有这种理解,其在科学、教育和制度生活中的负责任整合无法在原则基础上进行。本文认为,缺失的论述必须从一个尚未得到充分哲学关注的范式突破开始。在图灵-香农-冯·诺依曼的传统中,信息以编码的二进制向量进入机器,语义则保持在过程之外。神经网络架构打破了这一机制:符号输入瞬间投影到一个高维空间,其中坐标对应于语义参数,将二进制代码转化为几何意义空间中的一个位置。正是这个空间构成了塑造生成性生产的主动认识论条件。本文借鉴高维几何的四个结构特性——测度集中性、近正交性、指数方向能力和流形规律性,发展了一种高维空间的指示性认识论。基于皮尔士的符号学和帕佩特的建构主义,本文重新概念化生成模型为学习流形的导航者,并提出导航知识作为第三种知识生产模式,与符号推理和统计重组相区别。
cs.AI / 41 / 2602.17130

Efficient Parallel Algorithm for Decomposing Hard CircuitSAT Instances

高效并行算法用于分解困难的 CircuitSAT 实例
Kondratiev, Victor, Gribanova, Irina, Semenov, Alexander
Abstract
We propose a novel parallel algorithm for decomposing hard CircuitSAT instances. The technique employs specialized constraints to partition an original SAT instance into a family of weakened formulas. Our approach is implemented as a parameterized parallel algorithm, where adjusting the parameters allows efficient identification of high-quality decompositions, guided by hardness estimations computed in parallel. We demonstrate the algorithm's practical efficacy on challenging CircuitSAT instances, including those encoding Logical Equivalence Checking of Boolean circuits and preimage attacks on cryptographic hash functions.
Chinese Translation
我们提出了一种新颖的并行算法,用于分解困难的 CircuitSAT 实例。该技术采用专门的约束条件,将原始 SAT 实例划分为一系列弱化的公式。我们的方法实现为一个参数化的并行算法,通过调整参数,可以高效地识别高质量的分解,这一过程由并行计算的难度估计引导。我们在具有挑战性的 CircuitSAT 实例上展示了该算法的实际有效性,包括那些编码布尔电路的逻辑等价性检查和对密码哈希函数的前像攻击的实例。
cs.AI / 42 / 2602.17145

Bonsai: A Framework for Convolutional Neural Network Acceleration Using Criterion-Based Pruning

Bonsai:一种基于准则的卷积神经网络加速框架
Bingham, Joseph, Helmich, Sam
Abstract
As the need for more accurate and powerful Convolutional Neural Networks (CNNs) increases, so too does the size, execution time, memory footprint, and power consumption. To overcome this, solutions such as pruning have been proposed with their own metrics and methodologies, or criteria, for how weights should be removed. These solutions do not share a common implementation and are difficult to implement and compare. In this work, we introduce Combine, a criterion- based pruning solution and demonstrate that it is fast and effective framework for iterative pruning, demonstrate that criterion have differing effects on different models, create a standard language for comparing criterion functions, and propose a few novel criterion functions. We show the capacity of these criterion functions and the framework on VGG inspired models, pruning up to 79\% of filters while retaining or improving accuracy, and reducing the computations needed by the network by up to 68\%.
Chinese Translation
随着对更准确和强大的卷积神经网络(CNN)的需求增加,其规模、执行时间、内存占用和功耗也随之上升。为了克服这一问题,提出了一些解决方案,如剪枝,这些方案各自具有不同的指标和方法论,或称准则,用于决定如何移除权重。这些解决方案并没有共享一个共同的实现方式,且难以实施和比较。在本研究中,我们介绍了Combine,一种基于准则的剪枝解决方案,并展示了它作为一种快速有效的迭代剪枝框架的能力,证明了不同的准则对不同模型的影响不同,创建了一个比较准则函数的标准语言,并提出了一些新颖的准则函数。我们展示了这些准则函数及其框架在VGG启发模型上的能力,剪枝高达79 ext{%}的滤波器,同时保持或提高了准确性,并将网络所需的计算量减少了高达68 ext{%}。
cs.AI / 43 / 2602.17162

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures

JEPA-DNA:通过联合嵌入预测架构构建基因组基础模型
Larey, Ariel, Dahan, Elay, Bleiweiss, Amit, Kellerman, Raizy, Leib, Guy, Nayshool, Omri, Ofer, Dan, Zinger, Tal, Dominissini, Dan, Rechavi, Gideon, Bussola, Nicole, Lee, Simon, O'Connell, Shane, Hoang, Dung, Wirth, Marissa, Charney, Alexander W., Daniel, Nati, Shavit, Yoli
Abstract
Genomic Foundation Models (GFMs) have largely relied on Masked Language Modeling (MLM) or Next Token Prediction (NTP) to learn the language of life. While these paradigms excel at capturing local genomic syntax and fine-grained motif patterns, they often fail to capture the broader functional context, resulting in representations that lack a global biological perspective. We introduce JEPA-DNA, a novel pre-training framework that integrates the Joint-Embedding Predictive Architecture (JEPA) with traditional generative objectives. JEPA-DNA introduces latent grounding by coupling token-level recovery with a predictive objective in the latent space by supervising a CLS token. This forces the model to predict the high-level functional embeddings of masked genomic segments rather than focusing solely on individual nucleotides. JEPA-DNA extends both NTP and MLM paradigms and can be deployed either as a standalone from-scratch objective or as a continual pre-training enhancement for existing GFMs. Our evaluations across a diverse suite of genomic benchmarks demonstrate that JEPA-DNA consistently yields superior performance in supervised and zero-shot tasks compared to generative-only baselines. By providing a more robust and biologically grounded representation, JEPA-DNA offers a scalable path toward foundation models that understand not only the genomic alphabet, but also the underlying functional logic of the sequence.
Chinese Translation
基因组基础模型(GFMs)在很大程度上依赖于掩蔽语言建模(MLM)或下一个标记预测(NTP)来学习生命的语言。尽管这些范式在捕捉局部基因组语法和细粒度基序模式方面表现出色,但它们往往无法捕捉更广泛的功能上下文,导致生成的表示缺乏全球生物学视角。我们提出了JEPA-DNA,一种新颖的预训练框架,将联合嵌入预测架构(JEPA)与传统生成目标相结合。JEPA-DNA通过将标记级恢复与潜在空间中的预测目标结合,引入了潜在基础,这通过监督CLS标记来实现。这迫使模型预测掩蔽基因组片段的高级功能嵌入,而不仅仅关注单个核苷酸。JEPA-DNA扩展了NTP和MLM范式,可以作为独立的从零开始的目标或作为对现有GFMs的持续预训练增强进行部署。我们在多样化的基因组基准测试中的评估表明,JEPA-DNA在监督和零样本任务中始终优于仅生成基线。通过提供更强大且生物学上更为扎实的表示,JEPA-DNA为理解基因组字母及其序列背后的功能逻辑的基础模型提供了一条可扩展的路径。
cs.AI / 44 / 2602.17189

Texo: Formula Recognition within 20M Parameters

Texo:在2000万参数内的公式识别
Mao, Sicheng
Abstract
In this paper we present Texo, a minimalist yet highperformance formula recognition model that contains only 20 million parameters. By attentive design, distillation and transfer of the vocabulary and the tokenizer, Texo achieves comparable performance to state-of-the-art models such as UniMERNet-T and PPFormulaNet-S, while reducing the model size by 80% and 65%, respectively. This enables real-time inference on consumer-grade hardware and even in-browser deployment. We also developed a web application to demonstrate the model capabilities and facilitate its usage for end users.
Chinese Translation
在本文中,我们提出了Texo,一种极简但高性能的公式识别模型,仅包含2000万个参数。通过精心设计、词汇和分词器的蒸馏与迁移,Texo在性能上与最先进的模型如UniMERNet-T和PPFormulaNet-S相媲美,同时分别减少了模型大小的80%和65%。这使得在消费级硬件上实现实时推理成为可能,甚至可以在浏览器中部署。我们还开发了一个Web应用程序,以展示模型的能力并方便最终用户的使用。
cs.AI / 45 / 2602.17217

Continual learning and refinement of causal models through dynamic predicate invention

通过动态谓词发明实现因果模型的持续学习与优化
Crespo-Fernandez, Enrique, Ray, Oliver, Filho, Telmo de Menezes e Silva, Flach, Peter
Abstract
Efficiently navigating complex environments requires agents to internalize the underlying logic of their world, yet standard world modelling methods often struggle with sample inefficiency, lack of transparency, and poor scalability. We propose a framework for constructing symbolic causal world models entirely online by integrating continuous model learning and repair into the agent's decision loop, by leveraging the power of Meta-Interpretive Learning and predicate invention to find semantically meaningful and reusable abstractions, allowing an agent to construct a hierarchy of disentangled, high-quality concepts from its observations. We demonstrate that our lifted inference approach scales to domains with complex relational dynamics, where propositional methods suffer from combinatorial explosion, while achieving sample-efficiency orders of magnitude higher than the established PPO neural-network-based baseline.
Chinese Translation
高效地在复杂环境中导航要求智能体内化其世界的基本逻辑,然而标准的世界建模方法往往在样本效率、透明度和可扩展性方面存在困难。我们提出了一种框架,通过将持续模型学习和修复集成到智能体的决策循环中,完全在线构建符号因果世界模型,利用元解释学习(Meta-Interpretive Learning)和谓词发明的力量,寻找语义上有意义且可重用的抽象,使智能体能够从其观察中构建一个解耦的高质量概念层次。我们证明了我们的提升推理方法能够扩展到具有复杂关系动态的领域,而在这些领域中,命题方法则遭遇组合爆炸,同时实现的样本效率比现有的基于PPO神经网络的基线高出几个数量级。
cs.AI / 46 / 2602.17221

From Labor to Collaboration: A Methodological Experiment Using AI Agents to Augment Research Perspectives in Taiwan's Humanities and Social Sciences

从劳动到协作:利用人工智能代理增强台湾人文与社会科学研究视角的方法实验
Huang, Yi-Chih
Abstract
Generative AI is reshaping knowledge work, yet existing research focuses predominantly on software engineering and the natural sciences, with limited methodological exploration for the humanities and social sciences. Positioned as a "methodological experiment," this study proposes an AI Agent-based collaborative research workflow (Agentic Workflow) for humanities and social science research. Taiwan's Claude.ai usage data (N = 7,729 conversations, November 2025) from the Anthropic Economic Index (AEI) serves as the empirical vehicle for validating the feasibility of this methodology. This study operates on two levels: the primary level is the design and validation of a methodological framework - a seven-stage modular workflow grounded in three principles: task modularization, human-AI division of labor, and verifiability, with each stage delineating clear roles for human researchers (research judgment and ethical decisions) and AI Agents (information retrieval and text generation); the secondary level is the empirical analysis of AEI Taiwan data - serving as an operational demonstration of the workflow's application to secondary data research, showcasing both the process and output quality (see Appendix A). This study contributes by proposing a replicable AI collaboration framework for humanities and social science researchers, and identifying three operational modes of human-AI collaboration - direct execution, iterative refinement, and human-led - through reflexive documentation of the operational process. This taxonomy reveals the irreplaceability of human judgment in research question formulation, theoretical interpretation, contextualized reasoning, and ethical reflection. Limitations including single-platform data, cross-sectional design, and AI reliability risks are acknowledged.
Chinese Translation
生成性人工智能正在重塑知识工作,但现有研究主要集中在软件工程和自然科学领域,对人文与社会科学的方法探索有限。本研究被定位为“方法实验”,提出了一种基于人工智能代理的协作研究工作流程(Agentic Workflow),旨在促进人文与社会科学研究。台湾的Claude.ai使用数据(N = 7,729次对话,2025年11月)作为验证该方法可行性的实证依据。本研究从两个层面展开:第一层面是方法框架的设计与验证——一个基于任务模块化、人机分工和可验证性三项原则的七阶段模块化工作流程,每个阶段明确划分人类研究者(研究判断和伦理决策)与人工智能代理(信息检索和文本生成)的角色;第二层面是对AEI台湾数据的实证分析——作为该工作流程在二次数据研究中的应用操作示范,展示了过程和输出质量(见附录A)。本研究的贡献在于为人文与社会科学研究者提出了一个可复制的人工智能协作框架,并通过对操作过程的反思性文档,识别出人机协作的三种操作模式——直接执行、迭代优化和人主导。这一分类法揭示了人类判断在研究问题制定、理论解释、情境推理和伦理反思中的不可替代性。同时,研究也承认了单一平台数据、横断面设计和人工智能可靠性风险等局限性。
cs.AI / 47 / 2602.17222

Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight

解码人类因素:高保真行为预测用于战略前瞻
Yellin, Ben, Ezra, Ehud, Foreman, Mark, Grinapol, Shula
Abstract
Predicting human decision-making in high-stakes environments remains a central challenge for artificial intelligence. While large language models (LLMs) demonstrate strong general reasoning, they often struggle to generate consistent, individual-specific behavior, particularly when accurate prediction depends on complex interactions between psychological traits and situational constraints. Prompting-based approaches can be brittle in this setting, exhibiting identity drift and limited ability to leverage increasingly detailed persona descriptions. To address these limitations, we introduce the Large Behavioral Model (LBM), a behavioral foundation model fine-tuned to predict individual strategic choices with high fidelity. LBM shifts from transient persona prompting to behavioral embedding by conditioning on a structured, high-dimensional trait profile derived from a comprehensive psychometric battery. Trained on a proprietary dataset linking stable dispositions, motivational states, and situational constraints to observed choices, LBM learns to map rich psychological profiles to discrete actions across diverse strategic dilemmas. In a held-out scenario evaluation, LBM fine-tuning improves behavioral prediction relative to the unadapted Llama-3.1-8B-Instruct backbone and performs comparably to frontier baselines when conditioned on Big Five traits. Moreover, we find that while prompting-based baselines exhibit a complexity ceiling, LBM continues to benefit from increasingly dense trait profiles, with performance improving as additional trait dimensions are provided. Together, these results establish LBM as a scalable approach for high-fidelity behavioral simulation, enabling applications in strategic foresight, negotiation analysis, cognitive security, and decision support.
Chinese Translation
在高风险环境中预测人类决策仍然是人工智能面临的一个核心挑战。尽管大型语言模型(LLMs)展示了强大的通用推理能力,但它们在生成一致的、个体特定的行为方面常常表现不佳,特别是当准确预测依赖于心理特征与情境约束之间复杂的相互作用时。在这种情况下,基于提示的方法可能显得脆弱,表现出身份漂移和有限的利用越来越详细的人物描述的能力。为了解决这些局限性,我们引入了大型行为模型(Large Behavioral Model, LBM),这是一个经过微调的行为基础模型,旨在高保真度地预测个体的战略选择。LBM 从瞬时的人物提示转变为行为嵌入,通过基于来自全面心理测量电池的结构化高维特征档案进行条件化。LBM 在一个专有数据集上进行训练,该数据集将稳定的性格倾向、动机状态和情境约束与观察到的选择联系起来,从而学习将丰富的心理特征映射到不同战略困境中的离散行动。在一个保留场景评估中,LBM 的微调相较于未适应的 Llama-3.1-8B-Instruct 主干改善了行为预测,并在基于大五特质条件下的表现与前沿基线相当。此外,我们发现,尽管基于提示的基线模型表现出复杂性上限,LBM 仍然能够从越来越密集的特征档案中受益,随着提供的额外特征维度的增加,性能不断提升。这些结果共同确立了 LBM 作为一种可扩展的高保真行为模拟方法,为战略前瞻、谈判分析、认知安全和决策支持等应用提供了可能。
cs.AI / 48 / 2602.17229

Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

通过使用布鲁姆分类法的线性探测,揭示大型语言模型中认知复杂性的机制可解释性
Raimondi, Bianca, Gabbrielli, Maurizio
Abstract
The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of cognitive complexity using Bloom's Taxonomy as a hierarchical lens. By analyzing high-dimensional activation vectors from different LLMs, we probe whether different cognitive levels, ranging from basic recall (Remember) to abstract synthesis (Create), are linearly separable within the model's residual streams. Our results demonstrate that linear classifiers achieve approximately 95% mean accuracy across all Bloom levels, providing strong evidence that cognitive level is encoded in a linearly accessible subspace of the model's representations. These findings provide evidence that the model resolves the cognitive difficulty of a prompt early in the forward pass, with representations becoming increasingly separable across layers.
Chinese Translation
大型语言模型的黑箱特性需要新的评估框架,以超越表面层次的性能指标。本研究利用布鲁姆分类法作为层次化的视角,探讨认知复杂性的内部神经表征。通过分析来自不同大型语言模型的高维激活向量,我们探讨了从基本回忆(Remember)到抽象综合(Create)等不同认知水平是否在模型的残差流中线性可分。我们的结果表明,线性分类器在所有布鲁姆水平上实现了约95%的平均准确率,强有力地证明了认知水平编码在模型表征的线性可访问子空间中。这些发现提供了证据,表明模型在前向传播的早期阶段就解决了提示的认知难度,并且随着层次的增加,表征变得越来越可分。
cs.AI / 49 / 2602.17234

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

所有泄漏都重要,但有些更为重要:可解释的LLM回测中的时间污染检测
Zhang, Zeyu, Chen, Ryan, Stadie, Bradly C.
Abstract
To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.
Chinese Translation
为了评估大型语言模型(LLMs)是否能够准确预测未来事件,我们需要能够对已经发生的事件进行 extit{回测}。这要求模型仅依据在特定过去日期可用的信息进行推理。然而,LLMs可能在训练过程中无意中泄露截止日期之后的知识,从而削弱回顾性评估的有效性。我们提出了一种基于声明的框架,用于检测和量化这种 extit{时间知识泄漏}。我们的方法将模型的推理分解为原子声明,并根据时间可验证性对其进行分类,然后应用 extit{Shapley值}来衡量每个声明对预测的贡献。这产生了 extbf{Shapley}-加权的 extbf{D}ecision- extbf{C}ritical extbf{L}eakage extbf{R}ate( extbf{Shapley-DCLR}),这是一个可解释的指标,捕捉决策驱动推理中来自泄漏信息的比例。在此框架基础上,我们提出了 extbf{Time}- extbf{S}upervised extbf{P}rediction with extbf{E}xtracted extbf{C}laims( extbf{TimeSPEC}),该方法将生成与声明验证和再生成交替进行,以主动过滤时间污染——生成的预测中每个支持声明都可以追溯到截止日期之前可用的来源。对350个实例的实验,涵盖美国最高法院案件预测、NBA薪资估算和股票回报排名,揭示了标准提示基线中的显著泄漏。TimeSPEC在保持任务性能的同时减少了Shapley-DCLR,证明了显式的、可解释的声明级验证优于基于提示的时间约束,从而实现可靠的回测。
cs.AI / 50 / 2602.17245

Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web

网络动词:在自主网络上可靠任务组合的类型化抽象
Jiang, Linxi, Xi, Rui, Liu, Zhijie, Chen, Shuo, Lin, Zhiqiang, Nath, Suman
Abstract
The Web is evolving from a medium that humans browse to an environment where software agents act on behalf of users. Advances in large language models (LLMs) make natural language a practical interface for goal-directed tasks, yet most current web agents operate on low-level primitives such as clicks and keystrokes. These operations are brittle, inefficient, and difficult to verify. Complementing content-oriented efforts such as NLWeb's semantic layer for retrieval, we argue that the agentic web also requires a semantic layer for web actions. We propose \textbf{Web Verbs}, a web-scale set of typed, semantically documented functions that expose site capabilities through a uniform interface, whether implemented through APIs or robust client-side workflows. These verbs serve as stable and composable units that agents can discover, select, and synthesize into concise programs. This abstraction unifies API-based and browser-based paradigms, enabling LLMs to synthesize reliable and auditable workflows with explicit control and data flow. Verbs can carry preconditions, postconditions, policy tags, and logging support, which improves \textbf{reliability} by providing stable interfaces, \textbf{efficiency} by reducing dozens of steps into a few function calls, and \textbf{verifiability} through typed contracts and checkable traces. We present our vision, a proof-of-concept implementation, and representative case studies that demonstrate concise and robust execution compared to existing agents. Finally, we outline a roadmap for standardization to make verbs deployable and trustworthy at web scale.
Chinese Translation
网络正在从一个人类浏览的媒介演变为一个软件代理代表用户行动的环境。大型语言模型(LLMs)的进展使自然语言成为目标导向任务的实用接口,然而目前大多数网络代理仍然基于低级原语如点击和键击。这些操作脆弱、低效且难以验证。为了补充以内容为导向的努力,例如 NLWeb 的检索语义层,我们认为自主网络同样需要一个针对网络操作的语义层。我们提出了 extbf{网络动词},这是一个大规模的类型化、语义文档化的函数集合,通过统一接口暴露网站功能,无论是通过 API 还是强大的客户端工作流实现。这些动词作为稳定且可组合的单元,供代理发现、选择并合成成简洁的程序。这种抽象统一了基于 API 和基于浏览器的范式,使 LLM 能够合成可靠且可审计的工作流,具有明确的控制和数据流。动词可以携带前置条件、后置条件、政策标签和日志支持,这通过提供稳定接口提高了 extbf{可靠性},通过将数十个步骤简化为少数函数调用提高了 extbf{效率},并通过类型化合同和可检查的追踪提高了 extbf{可验证性}。我们展示了我们的愿景、一个概念验证实现以及代表性的案例研究,证明了与现有代理相比的简洁和强健执行。最后,我们概述了一个标准化路线图,以使动词在网络规模上可部署和可信。
cs.AI / 51 / 2602.17288

ArXiv-to-Model: A Practical Study of Scientific LM Training

ArXiv到模型:科学语言模型训练的实践研究
Gupta, Anuj
Abstract
While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.
Chinese Translation
尽管前沿的大型语言模型展示了强大的推理和数学能力,但从原始来源训练领域专用科学语言模型的实际过程仍然缺乏详细文献。在本研究中,我们呈现了一个详细的案例研究,直接从涵盖数学、计算机科学和理论物理的原始arXiv LaTeX源中训练一个参数为1.36B的科学语言模型。我们描述了一个端到端的流程,涵盖元数据过滤、档案验证、LaTeX提取、文本规范化、领域感知的分词以及在受限计算条件下(2xA100 GPU)的密集变换器训练。通过24次实验运行,我们分析了训练稳定性、扩展行为、数据产出损失和基础设施瓶颈。我们的研究结果强调了预处理决策如何显著影响可用的标记量,分词如何影响符号稳定性,以及存储和I/O限制如何与计算能力相媲美,成为限制因素。我们进一步分析了收敛动态,并展示了在数据丰富的环境中(52B预训练标记)稳定的训练行为。本研究并未提出一种新颖的架构,而是提供了一个基于工程的、透明的从零开始训练小型科学语言模型的过程。我们希望这些见解能支持在适度计算预算下运作的研究人员,帮助他们构建领域专用模型。
cs.AI / 52 / 2602.17308

MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions

MedClarify:一种用于医疗诊断的信息寻求AI代理,具备案例特定的后续问题
Wong, Hui Min, Heesen, Philip, Janetzky, Pascal, Bendszus, Martin, Feuerriegel, Stefan
Abstract
Large language models (LLMs) are increasingly used for diagnostic tasks in medicine. In clinical practice, the correct diagnosis can rarely be immediately inferred from the initial patient presentation alone. Rather, reaching a diagnosis often involves systematic history taking, during which clinicians reason over multiple potential conditions through iterative questioning to resolve uncertainty. This process requires considering differential diagnoses and actively excluding emergencies that demand immediate intervention. Yet, the ability of medical LLMs to generate informative follow-up questions and thus reason over differential diagnoses remains underexplored. Here, we introduce MedClarify, an AI agent for information-seeking that can generate follow-up questions for iterative reasoning to support diagnostic decision-making. Specifically, MedClarify computes a list of candidate diagnoses analogous to a differential diagnosis, and then proactively generates follow-up questions aimed at reducing diagnostic uncertainty. By selecting the question with the highest expected information gain, MedClarify enables targeted, uncertainty-aware reasoning to improve diagnostic performance. In our experiments, we first demonstrate the limitations of current LLMs in medical reasoning, which often yield multiple, similarly likely diagnoses, especially when patient cases are incomplete or relevant information for diagnosis is missing. We then show that our information-theoretic reasoning approach can generate effective follow-up questioning and thereby reduces diagnostic errors by ~27 percentage points (p.p.) compared to a standard single-shot LLM baseline. Altogether, MedClarify offers a path to improve medical LLMs through agentic information-seeking and to thus promote effective dialogues with medical LLMs that reflect the iterative and uncertain nature of real-world clinical reasoning.
Chinese Translation
大型语言模型(LLMs)在医学诊断任务中的应用日益增多。在临床实践中,正确的诊断通常无法仅通过初始患者表现立即推断得出。相反,达成诊断通常涉及系统的病史采集,在此过程中,临床医生通过迭代提问对多种潜在病症进行推理,以消除不确定性。这个过程需要考虑鉴别诊断,并积极排除需要立即干预的紧急情况。然而,医疗LLMs生成信息丰富的后续问题以进行鉴别诊断推理的能力仍然未得到充分探索。在此,我们介绍MedClarify,一种信息寻求的AI代理,能够生成后续问题以支持诊断决策的迭代推理。具体而言,MedClarify计算出一系列候选诊断,类似于鉴别诊断,然后主动生成旨在减少诊断不确定性的后续问题。通过选择预期信息增益最高的问题,MedClarify实现了针对性、关注不确定性的推理,从而提高诊断性能。在我们的实验中,我们首先展示了当前LLMs在医学推理中的局限性,尤其是在患者案例不完整或缺乏相关诊断信息时,往往会产生多个同样可能的诊断。然后,我们展示了我们的信息论推理方法能够生成有效的后续问题,从而将诊断错误率降低约27个百分点(p.p.),与标准的单次LLM基线相比。总之,MedClarify为通过代理信息寻求改善医疗LLMs提供了一条途径,从而促进与医疗LLMs的有效对话,反映现实临床推理的迭代和不确定性特征。
cs.AI / 53 / 2602.17385

Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

通过克罗内克因子近似曲率实现无数据权重解耦的任务算术
Porrello, Angelo, Buzzega, Pietro, Dangel, Felix, Sommariva, Thomas, Salami, Riccardo, Bonicelli, Lorenzo, Calderara, Simone
Abstract
Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.
Chinese Translation
任务算术提供了一种模块化、可扩展的方式来适应基础模型。然而,组合多个任务向量可能导致跨任务干扰,从而引起表示漂移和性能下降。表示漂移正则化为解耦任务向量提供了一种自然的解决方案;然而,现有方法通常需要外部任务数据,这与模块化和数据可用性约束(例如隐私要求)相冲突。我们提出了一种无数据的方法,将表示漂移的正则化框架视为曲率矩阵近似问题。这使我们能够利用成熟的技术;特别是,我们采用克罗内克因子近似曲率,获得了一种实用的正则化器,在任务加法和减法中实现了最先进的结果。我们的方法在任务数量上具有恒定复杂度,并增强了对任务向量重标定的鲁棒性,消除了对保留调优的需求。
cs.AI / 54 / 2602.17386

Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval

视觉模型检查:基于图的视觉例程推理用于图像检索
Molina, Adrià, Terrades, Oriol Ramos, Lladós, Josep
Abstract
Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.
Chinese Translation
信息检索是现代数字产业的基础。尽管自然语言搜索在近年来取得了显著进展,这主要得益于基于嵌入的模型和大规模预训练,但该领域仍面临重大挑战。具体而言,涉及复杂关系、对象组合或精确约束(如身份、数量和比例)的查询在当前框架中往往仍未得到解决或不可靠。本文提出了一种新颖的框架,通过将基于图的验证方法与神经代码生成的协同组合,整合形式验证到基于深度学习的图像检索中。我们的方法旨在支持开放词汇的自然语言查询,同时生成既可信又可验证的结果。通过将检索结果基于形式推理系统,我们超越了通常特征化向量表示的模糊性和近似性。我们的框架明确验证用户查询中的每个原子真理与检索内容的匹配。这使我们不仅能够返回匹配结果,还能识别并标记哪些具体约束得到了满足,哪些仍未满足,从而提供更透明和可问责的检索过程,同时提升最流行的基于嵌入的方法的结果。
cs.AI / 55 / 2602.17402

A Contrastive Variational AutoEncoder for NSCLC Survival Prediction with Missing Modalities

一种对比变分自编码器用于缺失模态的非小细胞肺癌生存预测
Zanitti, Michele, Miskovic, Vanja, Trovò, Francesco, Pedrocchi, Alessandra Laura Giulia, Shen, Ming, Tun, Yan Kyaw, Prelaj, Arsela, Kosta, Sokol
Abstract
Predicting survival outcomes for non-small cell lung cancer (NSCLC) patients is challenging due to the different individual prognostic features. This task can benefit from the integration of whole-slide images, bulk transcriptomics, and DNA methylation, which offer complementary views of the patient's condition at diagnosis. However, real-world clinical datasets are often incomplete, with entire modalities missing for a significant fraction of patients. State-of-the-art models rely on available data to create patient-level representations or use generative models to infer missing modalities, but they lack robustness in cases of severe missingness. We propose a Multimodal Contrastive Variational AutoEncoder (MCVAE) to address this issue: modality-specific variational encoders capture the uncertainty in each data source, and a fusion bottleneck with learned gating mechanisms is introduced to normalize the contributions from present modalities. We propose a multi-task objective that combines survival loss and reconstruction loss to regularize patient representations, along with a cross-modal contrastive loss that enforces cross-modal alignment in the latent space. During training, we apply stochastic modality masking to improve the robustness to arbitrary missingness patterns. Extensive evaluations on the TCGA-LUAD (n=475) and TCGA-LUSC (n=446) datasets demonstrate the efficacy of our approach in predicting disease-specific survival (DSS) and its robustness to severe missingness scenarios compared to two state-of-the-art models. Finally, we bring some clarifications on multimodal integration by testing our model on all subsets of modalities, finding that integration is not always beneficial to the task.
Chinese Translation
预测非小细胞肺癌(NSCLC)患者的生存结果具有挑战性,因为不同个体的预后特征各异。该任务可以通过整张切片图像、整体转录组学和DNA甲基化的整合来受益,这些数据在诊断时提供了对患者状况的互补视角。然而,现实世界的临床数据集往往是不完整的,许多患者缺失整个模态。最先进的模型依赖于可用数据来创建患者级别的表示,或使用生成模型来推断缺失模态,但在严重缺失的情况下缺乏鲁棒性。我们提出了一种多模态对比变分自编码器(MCVAE)来解决这一问题:模态特定的变分编码器捕捉每个数据源的不确定性,并引入具有学习门控机制的融合瓶颈,以规范来自现有模态的贡献。我们提出了一种多任务目标,结合生存损失和重建损失来正则化患者表示,同时引入交叉模态对比损失,以强制潜在空间中的跨模态对齐。在训练过程中,我们应用随机模态掩蔽以提高对任意缺失模式的鲁棒性。在TCGA-LUAD(n=475)和TCGA-LUSC(n=446)数据集上的广泛评估证明了我们的方法在预测疾病特异性生存(DSS)方面的有效性,并且在严重缺失场景下相较于两种最先进的模型表现出更强的鲁棒性。最后,我们通过在所有模态子集上测试我们的模型,澄清了多模态整合的某些问题,发现整合并不总是对任务有利。
cs.AI / 56 / 2602.17418

A Privacy by Design Framework for Large Language Model-Based Applications for Children

面向儿童的大型语言模型应用的隐私设计框架
Addae, Diana, Rogachova, Diana, Kahani, Nafiseh, Barati, Masoud, Christensen, Michael, Zhou, Chen
Abstract
Children are increasingly using technologies powered by Artificial Intelligence (AI). However, there are growing concerns about privacy risks, particularly for children. Although existing privacy regulations require companies and organizations to implement protections, doing so can be challenging in practice. To address this challenge, this article proposes a framework based on Privacy-by-Design (PbD), which guides designers and developers to take on a proactive and risk-averse approach to technology design. Our framework includes principles from several privacy regulations, such as the General Data Protection Regulation (GDPR) from the European Union, the Personal Information Protection and Electronic Documents Act (PIPEDA) from Canada, and the Children's Online Privacy Protection Act (COPPA) from the United States. We map these principles to various stages of applications that use Large Language Models (LLMs), including data collection, model training, operational monitoring, and ongoing validation. For each stage, we discuss the operational controls found in the recent academic literature to help AI service providers and developers reduce privacy risks while meeting legal standards. In addition, the framework includes design guidelines for children, drawing from the United Nations Convention on the Rights of the Child (UNCRC), the UK's Age-Appropriate Design Code (AADC), and recent academic research. To demonstrate how this framework can be applied in practice, we present a case study of an LLM-based educational tutor for children under 13. Through our analysis and the case study, we show that by using data protection strategies such as technical and organizational controls and making age-appropriate design decisions throughout the LLM life cycle, we can support the development of AI applications for children that provide privacy protections and comply with legal requirements.
Chinese Translation
儿童越来越多地使用由人工智能(AI)驱动的技术。然而,关于隐私风险的担忧日益增加,尤其是针对儿童。尽管现有的隐私法规要求公司和组织实施保护措施,但在实践中做到这一点可能面临挑战。为了解决这一挑战,本文提出了一个基于隐私设计(Privacy-by-Design, PbD)的框架,旨在引导设计师和开发者采取主动和规避风险的技术设计方法。我们的框架包括来自多个隐私法规的原则,例如欧盟的通用数据保护条例(General Data Protection Regulation, GDPR)、加拿大的个人信息保护和电子文档法(Personal Information Protection and Electronic Documents Act, PIPEDA)以及美国的儿童在线隐私保护法(Children's Online Privacy Protection Act, COPPA)。我们将这些原则映射到使用大型语言模型(Large Language Models, LLMs)的应用程序的各个阶段,包括数据收集、模型训练、操作监控和持续验证。在每个阶段,我们讨论了近期学术文献中发现的操作控制,以帮助人工智能服务提供商和开发者在满足法律标准的同时降低隐私风险。此外,该框架还包括针对儿童的设计指南,借鉴了《联合国儿童权利公约》(United Nations Convention on the Rights of the Child, UNCRC)、英国的适龄设计规范(Age-Appropriate Design Code, AADC)以及近期的学术研究。为了展示该框架在实践中的应用,我们呈现了一个针对13岁以下儿童的基于LLM的教育辅导案例研究。通过我们的分析和案例研究,我们表明,通过在LLM生命周期中使用数据保护策略,如技术和组织控制,并做出适龄设计决策,我们可以支持开发提供隐私保护并符合法律要求的儿童人工智能应用。
cs.AI / 57 / 2602.17442

WarpRec: Unifying Academic Rigor and Industrial Scale for Responsible, Reproducible, and Efficient Recommendation

WarpRec:统一学术严谨性与工业规模,实现负责任、可重复和高效的推荐系统
Avolio, Marco, Aghilar, Potito, Roccotelli, Sabino, Anelli, Vito Walter, Mallamaci, Chiara, Paparella, Vincenzo, Valentini, Marco, Bellogín, Alejandro, Trizio, Michelantonio, Trotta, Joseph, Ferrara, Antonio, Di Noia, Tommaso
Abstract
Innovation in Recommender Systems is currently impeded by a fractured ecosystem, where researchers must choose between the ease of in-memory experimentation and the costly, complex rewriting required for distributed industrial engines. To bridge this gap, we present WarpRec, a high-performance framework that eliminates this trade-off through a novel, backend-agnostic architecture. It includes 50+ state-of-the-art algorithms, 40 metrics, and 19 filtering and splitting strategies that seamlessly transition from local execution to distributed training and optimization. The framework enforces ecological responsibility by integrating CodeCarbon for real-time energy tracking, showing that scalability need not come at the cost of scientific integrity or sustainability. Furthermore, WarpRec anticipates the shift toward Agentic AI, leading Recommender Systems to evolve from static ranking engines into interactive tools within the Generative AI ecosystem. In summary, WarpRec not only bridges the gap between academia and industry but also can serve as the architectural backbone for the next generation of sustainable, agent-ready Recommender Systems. Code is available at https://github.com/sisinflab/warprec/
Chinese Translation
当前,推荐系统的创新受到生态系统碎片化的阻碍,研究人员必须在内存实验的便利性与分布式工业引擎所需的高成本、复杂重写之间做出选择。为了解决这一问题,我们提出了WarpRec,一个高性能框架,通过一种新颖的后端无关架构消除了这一权衡。该框架包含50多种最先进的算法、40种指标和19种过滤与拆分策略,能够无缝地从本地执行过渡到分布式训练和优化。该框架通过集成CodeCarbon实现实时能源追踪,强化了生态责任,表明可扩展性不必以科学诚信或可持续性为代价。此外,WarpRec预见到向智能代理人工智能(Agentic AI)的转变,引导推荐系统从静态排名引擎演变为生成式人工智能生态系统中的互动工具。总之,WarpRec不仅弥合了学术界与工业界之间的差距,还可以作为下一代可持续、适应智能代理的推荐系统的架构支撑。代码可在 https://github.com/sisinflab/warprec/ 获取。
cs.AI / 58 / 2602.17508

Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

基于帕累托最优的人工智能模型在ARM Cortex处理器上的基准测试:面向可持续嵌入式系统
Jain, Pranay, Kasper, Maximilian, Köber, Göran, Plinge, Axel, Seuß, Dominik
Abstract
This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.
Chinese Translation
本研究提出了一种实用的基准测试框架,用于优化在ARM Cortex处理器(M0+、M4、M7)上的人工智能(AI)模型,重点关注嵌入式系统中的能效、准确性和资源利用率。通过设计自动化测试平台,我们提供了一种系统的方法来评估关键性能指标(KPI),并识别处理器与AI模型的最佳组合。研究强调了浮点运算(FLOPs)与推理时间之间的近线性相关性,提供了一个可靠的度量标准来估算计算需求。通过帕累托分析,我们展示了如何在能耗与模型准确性之间进行权衡,确保AI应用在不妥协可持续性的前提下满足性能要求。主要发现表明,M7处理器非常适合短推理周期,而M4处理器在较长推理任务中提供更好的能效。尽管M0+处理器在复杂AI模型中效率较低,但仍适用于简单任务。本研究为开发者提供了见解,指导他们设计能效高、在实际应用中表现优异的AI系统。
cs.AI / 59 / 2602.17529

Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation

利用动态知识图谱和可解释的检索增强生成技术提升电信领域的大型语言模型(LLMs)
Yuan, Dun, Zhou, Hao, Liu, Xue, Chen, Hao, Xin, Yan, Jianzhong, Zhang
Abstract
Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due to domain complexity, evolving standards, and specialized terminology. Therefore, general-domain LLMs may struggle to provide accurate and reliable outputs in this context, leading to increased hallucinations and reduced utility in telecom operations.To address these limitations, this work introduces KG-RAG-a novel framework that integrates knowledge graphs (KGs) with retrieval-augmented generation (RAG) to enhance LLMs for telecom-specific tasks. In particular, the KG provides a structured representation of domain knowledge derived from telecom standards and technical documents, while RAG enables dynamic retrieval of relevant facts to ground the model's outputs. Such a combination improves factual accuracy, reduces hallucination, and ensures compliance with telecom specifications.Experimental results across benchmark datasets demonstrate that KG-RAG outperforms both LLM-only and standard RAG baselines, e.g., KG-RAG achieves an average accuracy improvement of 14.3% over RAG and 21.6% over LLM-only models. These results highlight KG-RAG's effectiveness in producing accurate, reliable, and explainable outputs in complex telecom scenarios.
Chinese Translation
大型语言模型(LLMs)在多种任务中展现出强大的潜力,但由于领域复杂性、标准不断演变以及专业术语的存在,其在电信领域的应用仍然面临挑战。因此,通用领域的LLMs在此背景下可能难以提供准确和可靠的输出,导致幻觉现象增加,降低了在电信操作中的实用性。为了解决这些局限性,本研究提出了KG-RAG——一个将知识图谱(KGs)与检索增强生成(RAG)相结合的新框架,以提升LLMs在电信特定任务中的表现。具体而言,KG提供了基于电信标准和技术文档的领域知识的结构化表示,而RAG则能够动态检索相关事实,以支撑模型的输出。这种结合提高了事实准确性,减少了幻觉现象,并确保符合电信规范。在基准数据集上的实验结果表明,KG-RAG在性能上优于仅使用LLM和标准RAG的基线,例如,KG-RAG在准确性上比RAG提高了14.3%,比仅使用LLM的模型提高了21.6%。这些结果突显了KG-RAG在复杂电信场景中生成准确、可靠和可解释输出的有效性。
cs.AI / 60 / 2602.17544

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

通过可重用性和可验证性评估思维链推理
Aggarwal, Shashank, Mishra, Ram Vikas, Awekar, Amit
Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
Chinese Translation
在多智能体信息检索(IR)管道中,针对搜索和排序等任务,基于大型语言模型(LLM)的智能体之间以思维链(Chain-of-Thought, CoT)的形式交换中间推理。目前的思维链评估主要集中在目标任务的准确性上。然而,这一指标未能评估推理过程本身的质量或实用性。为了解决这一局限性,我们引入了两个新颖的度量标准:可重用性和可验证性。我们使用思考者-执行者(Thinker-Executor)框架将思维链的生成与执行解耦。可重用性衡量执行者重用思考者的思维链的难易程度。可验证性衡量执行者使用思维链匹配思考者答案的频率。我们对四个思考者模型进行了评估,比较了十个执行者模型在五个基准测试中的表现。我们的结果表明,可重用性和可验证性与标准准确性并不相关,揭示了当前基于准确性的推理能力排行榜中的盲点。令人惊讶的是,我们发现来自专门推理模型的思维链并不总是比来自通用大型语言模型(如 Llama 和 Gemma)的思维链更具可重用性或可验证性。
cs.AI / 61 / 2602.17547

KLong: Training LLM Agent for Extremely Long-horizon Tasks

KLong:为极长时间范围任务训练的LLM代理
Liu, Yue, Hu, Zhiyuan, Sung, Flood, Zhang, Jiaheng, Hooi, Bryan
Abstract
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Chinese Translation
本文介绍了KLong,一个开源的LLM代理,旨在解决极长时间范围的任务。其原理是首先通过轨迹分割的监督微调(SFT)对模型进行冷启动,然后通过渐进式强化学习(RL)训练进行扩展。具体而言,我们首先通过一套全面的SFT方案激活基础模型的基本代理能力。接着,我们引入Research-Factory,一个自动化管道,通过收集研究论文和构建评估标准生成高质量的训练数据。利用该管道,我们构建了数千条从Claude 4.5 Sonnet(Thinking)提炼的长时间范围轨迹。为了使用这些极长轨迹进行训练,我们提出了一种新的轨迹分割SFT,它保留早期上下文,逐步截断后期上下文,并保持子轨迹之间的重叠。此外,为了进一步提高长时间范围任务的解决能力,我们提出了一种新颖的渐进式强化学习,它将训练安排为多个阶段,并逐步延长超时。实验结果表明KLong的优越性和泛化能力,如图1所示。值得注意的是,我们提出的KLong(106B)在PaperBench上超越了Kimi K2 Thinking(1T)11.28%,并且性能提升在其他编码基准测试如SWE-bench Verified和MLE-bench中也得到了验证。
cs.AI / 62 / 2602.17560

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

ODESteer:一种统一的基于常微分方程的LLM对齐引导框架
Zhao, Hongjue, Sun, Haosen, Kong, Jiangtao, Li, Xiaochang, Wang, Qineng, Jiang, Liwei, Zhu, Qi, Abdelzaher, Tarek, Choi, Yejin, Li, Manling, Shao, Huajie
Abstract
Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
Chinese Translation
激活引导或表示工程提供了一种轻量级的方法,通过在推理时操控大型语言模型(LLMs)的内部激活来实现对齐。然而,当前的方法存在两个主要局限性: extit{(i)} 缺乏统一的理论框架来指导引导方向的设计,以及 extit{(ii)} 过度依赖 extit{单步引导},未能捕捉激活分布的复杂模式。在本研究中,我们提出了一种基于常微分方程(ODEs)的统一 extit{理论}框架,用于LLM对齐中的激活引导。我们展示了传统的激活加法可以被解释为ODE解的一级近似。基于这一ODE视角,识别引导方向等同于从控制理论中设计一个 extit{障碍函数}。基于该框架,我们引入了ODESteer,这是一种由障碍函数引导的基于ODE的引导方法,显示出在LLM对齐方面的 extit{经验}进展。ODESteer通过将障碍函数定义为正负激活之间的对数密度比来识别引导方向,并利用它构建用于 extit{多步和自适应}引导的ODE。与最先进的激活引导方法相比,ODESteer在多种LLM对齐基准上实现了一致的经验改进,特别是在TruthfulQA上提高了$5.7\%$,在UltraFeedback上提高了$2.5\\%$,在RealToxicityPrompts上提高了$2.4\\%$。我们的工作通过通过ODE统一其理论基础,建立了对LLM对齐中激活引导的新原则视角,并通过提出的ODESteer方法进行了经验验证。
cs.AI / 63 / 2602.17566

A Hybrid Federated Learning Based Ensemble Approach for Lung Disease Diagnosis Leveraging Fusion of SWIN Transformer and CNN

基于混合联邦学习的肺病诊断集成方法:融合SWIN Transformer与CNN
Chowdhury, Asif Hasan, Islam, Md. Fahim, Riad, M Ragib Anjum, Hashem, Faiyaz Bin, Reza, Md Tanzim, Alam, Md. Golam Rabiul
Abstract
The significant advancements in computational power cre- ate a vast opportunity for using Artificial Intelligence in different ap- plications of healthcare and medical science. A Hybrid FL-Enabled Ensemble Approach For Lung Disease Diagnosis Leveraging a Combination of SWIN Transformer and CNN is the combination of cutting-edge technology of AI and Federated Learning. Since, medi- cal specialists and hospitals will have shared data space, based on that data, with the help of Artificial Intelligence and integration of federated learning, we can introduce a secure and distributed system for medical data processing and create an efficient and reliable system. The proposed hybrid model enables the detection of COVID-19 and Pneumonia based on x-ray reports. We will use advanced and the latest available tech- nology offered by Tensorflow and Keras along with Microsoft-developed Vision Transformer, that can help to fight against the pandemic that the world has to fight together as a united. We focused on using the latest available CNN models (DenseNet201, Inception V3, VGG 19) and the Transformer model SWIN Transformer in order to prepare our hy- brid model that can provide a reliable solution as a helping hand for the physician in the medical field. In this research, we will discuss how the Federated learning-based Hybrid AI model can improve the accuracy of disease diagnosis and severity prediction of a patient using the real-time continual learning approach and how the integration of federated learn- ing can ensure hybrid model security and keep the authenticity of the information.
Chinese Translation
计算能力的显著进步为人工智能在医疗保健和医学科学的不同应用中创造了广阔的机会。基于混合联邦学习的肺病诊断集成方法结合了人工智能和联邦学习的前沿技术。由于医疗专家和医院将共享数据空间,基于这些数据,借助人工智能和联邦学习的整合,我们可以引入一个安全且分布式的医疗数据处理系统,从而创建一个高效且可靠的系统。所提出的混合模型能够基于X光报告检测COVID-19和肺炎。我们将利用Tensorflow和Keras提供的先进技术,以及微软开发的视觉变换器(Vision Transformer),帮助应对全球共同面对的疫情。我们专注于使用最新的CNN模型(DenseNet201、Inception V3、VGG 19)和变换器模型SWIN Transformer,以构建我们的混合模型,为医疗领域的医生提供可靠的解决方案。在本研究中,我们将讨论基于联邦学习的混合人工智能模型如何通过实时持续学习方法提高疾病诊断和患者病情预测的准确性,以及联邦学习的整合如何确保混合模型的安全性并保持信息的真实性。
cs.AI / 64 / 2602.17594

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

AI游戏商店:可扩展的、开放式的机器通用智能评估方法与人类游戏
Ying, Lance, Truong, Ryan, Sharma, Prafull, Zhao, Kaiya Ivy, Cloos, Nathan, Allen, Kelsey R., Griffiths, Thomas L., Collins, Katherine M., Hernández-Orallo, José, Isola, Phillip, Gershman, Samuel J., Tenenbaum, Joshua B.
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
Chinese Translation
在快速技术进步的时代,严格评估机器智能与人类通用智能广泛谱系之间的关系变得愈加重要和具有挑战性。传统的人工智能基准通常仅评估有限范围内的人类活动的狭窄能力。大多数基准也是静态的,随着开发者显式或隐式地进行优化,迅速达到饱和。我们提出,通过一种特别强大的通用游戏玩法来评估类人通用智能,可能是一种更有前景的方法:研究人工智能系统如何以及多好地玩和学习玩 extbf{所有可以想象的人类游戏},并与具有相同经验、时间或其他资源的人类玩家进行比较。我们将“人类游戏”定义为由人类为人类设计的游戏,并论证这一人类可以想象和享受的所有游戏空间——“人类游戏多元宇宙”的评估适宜性。作为实现这一愿景的第一步,我们推出了AI游戏商店,这是一个可扩展的开放平台,利用人机协作的方式,通过自动获取和适应来自流行人类数字游戏平台的标准化和容器化游戏环境变体,合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100款此类游戏,并对七个前沿视觉语言模型(VLMs)在短期游戏体验中进行了评估。最佳模型在大多数游戏中的得分未超过人类平均分的10\%,尤其在挑战世界模型学习、记忆和规划的游戏中表现不佳。我们总结了一系列下一步措施,以将AI游戏商店构建为一种实用的方法,以衡量和推动机器向类人通用智能的进步。
cs.AI / 65 / 2602.17602

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

MolHIT:利用层次离散扩散模型推进分子图生成
Jung, Hojung, Hormazabal, Rodrigo, Jo, Jaehyeong, Park, Youngrok, Roh, Kyunggeun, Yun, Se-Young, Han, Sehui, Jeong, Dae-Woong
Abstract
Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle to meet the desired properties compared to 1D modeling. In this work, we introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods. MolHIT is based on the Hierarchical Discrete Diffusion Model, which generalizes discrete diffusion to additional categories that encode chemical priors, and decoupled atom encoding that splits the atom types according to their chemical roles. Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. We further demonstrate strong performance in downstream tasks, including multi-property guided generation and scaffold extension.
Chinese Translation
利用扩散模型进行分子生成已成为人工智能驱动的药物发现和材料科学中的一个有前景的方向。尽管由于二维分子图的离散特性,图扩散模型得到了广泛应用,但现有模型在化学有效性方面表现不佳,并且在满足所需属性方面相较于一维建模存在困难。在本研究中,我们介绍了MolHIT,一个强大的分子图生成框架,克服了现有方法中的长期性能限制。MolHIT基于层次离散扩散模型,该模型将离散扩散推广到编码化学先验的额外类别,并采用解耦的原子编码,根据原子的化学角色拆分原子类型。总体而言,MolHIT在MOSES数据集上首次实现了图扩散的新一代最先进性能,具有近乎完美的有效性,超越了多个指标下强大的1D基线。我们进一步展示了在下游任务中的强大表现,包括多属性引导生成和骨架扩展。
cs.AI / 66 / 2602.17607

AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing

AutoNumerics:一个自主的、与偏微分方程无关的多智能体科学计算管道
Du, Jianda, Sun, Youran, Yang, Haizhao
Abstract
PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics}, a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black-box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse-to-fine execution strategy and a residual-based self-verification mechanism. Experiments on 24 canonical and real-world PDE problems demonstrate that \texttt{AutoNumerics} achieves competitive or superior accuracy compared to existing neural and LLM-based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.
Chinese Translation
偏微分方程(PDE)在科学和工程建模中至关重要,但设计准确的数值求解器通常需要大量的数学专业知识和手动调优。最近基于神经网络的方法提高了灵活性,但往往需要高昂的计算成本,并且可解释性有限。我们介绍了 exttt{AutoNumerics},这是一个多智能体框架,能够根据自然语言描述自主设计、实现、调试和验证通用偏微分方程的数值求解器。与黑箱神经求解器不同,我们的框架生成基于经典数值分析的透明求解器。我们引入了一种粗到细的执行策略和基于残差的自我验证机制。在24个经典和实际的偏微分方程问题上的实验表明, exttt{AutoNumerics}在准确性上与现有的神经网络和基于大语言模型(LLM)的基线相比表现出竞争力或更优的结果,并且能够根据偏微分方程的结构特性正确选择数值方案,表明其作为自动化偏微分方程求解的可行性。
cs.AI / 67 / 2602.17663

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

CLEF HIPE-2026:评估从多语言历史文本中准确高效的人物-地点关系提取
Opitz, Juri, Raclé, Corina, Boros, Emanuela, Michail, Andrianos, Romanello, Matteo, Ehrmann, Maud, Clematide, Simon
Abstract
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
Chinese Translation
HIPE-2026 是一个专注于从嘈杂的多语言历史文本中提取人物-地点关系的 CLEF 评估实验室。该实验室在 HIPE-2020 和 HIPE-2022 活动的基础上,进一步扩展了语义关系提取的系列,目标是识别多种语言和时间段中的人物-地点关联。系统被要求对两种类型的关系进行分类——$at$(“这个人曾经在这个地方吗?”)和 $isAt$(“这个人在出版时是否位于这个地方?”),这需要对时间和地理线索进行推理。该实验室引入了一个三重评估框架,综合评估准确性、计算效率和领域泛化能力。通过将关系提取与大规模历史数据处理相结合,HIPE-2026 旨在支持知识图谱构建、历史传记重建和数字人文学科中的空间分析等下游应用。
计算语言学 (Computation and Language)
43
cs.CL / 1 / 2602.16802

References Improve LLM Alignment in Non-Verifiable Domains

参考文献提升非可验证领域中的大语言模型对齐
Shi, Kejian, Liu, Yixin, Wang, Peifeng, Fabbri, Alexander R., Joty, Shafiq, Cohan, Arman
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
Chinese Translation
尽管具有可验证奖励的强化学习(RLVR)在推理任务中表现出强大的有效性,但它无法直接应用于缺乏真实验证者的非可验证领域,例如大语言模型(LLM)的对齐。在本研究中,我们探讨了参考引导的LLM评估者是否能够通过充当软“验证者”来弥补这一差距。首先,我们设计了评估协议,利用参考输出增强基于LLM的评估者在LLM对齐中的表现。通过全面的实验,我们表明,参考引导的方法显著提高了使用前沿模型参考的能力较弱的LLM评判者的准确性;更强的LLM评判者也可以通过高质量(即人类撰写的)参考得到增强。在这些改进的评判者基础上,我们展示了高质量参考在对齐调优中的实用性,其中使用参考引导的LLM作为评判者进行自我改进。我们表明,参考引导的自我改进在直接的参考输出微调(SFT)和无参考评判者的自我改进上均有明显收益,达到了与强大的微调奖励模型ArmoRM相当的性能。具体而言,我们的方法在使用Llama-3-8B-Instruct的AlpacaEval和Arena-Hard上分别达到了73.1%和58.7%,在使用Qwen2.5-7B时达到了70.0%和74.1%,相较于SFT蒸馏分别平均提高了+20.2 / +17.1分,以及在AlpacaEval / Arena-Hard上相较于无参考自我改进提高了+5.3 / +3.6分。这些结果突显了使用参考引导的LLM评估者在非可验证领域中实现有效LLM后训练的潜力。
cs.CL / 2 / 2602.16811

Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

评估单语和多语大型语言模型在希腊问答中的表现:DemosQA基准
Mastrokostas, Charalampos, Giarelis, Nikolaos, Karacapilidis, Nikos
Abstract
Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.
Chinese Translation
近年来,自然语言处理和深度学习的进展使得大型语言模型(LLMs)的发展成为可能,这些模型在包括问答(QA)在内的广泛任务中显著提升了最新技术水平。尽管取得了这些进展,LLMs的研究主要集中在资源丰富的语言(例如,英语)上,最近才开始关注多语言模型。然而,这些模型在训练数据上存在偏向于少数流行语言的问题,或者依赖于从资源丰富语言到资源匮乏语言的迁移学习;这可能导致社会、文化和历史方面的误表征。为了解决这一挑战,针对资源匮乏语言开发了单语LLMs;然而,与多语模型相比,它们在语言特定任务上的有效性仍然研究较少。在本研究中,我们通过以下贡献填补了希腊问答领域的研究空白:(i)DemosQA,一个新颖的数据集,使用社交媒体用户提问和社区审核的答案构建,以更好地捕捉希腊的社会和文化精神;(ii)一个适应于多种问答数据集和语言的内存高效LLM评估框架;以及(iii)对11个单语和多语LLMs在6个人工策划的希腊问答数据集上的广泛评估,使用3种不同的提示策略。我们发布了我们的代码和数据,以促进可重复性。
cs.CL / 3 / 2602.16813

One-step Language Modeling via Continuous Denoising

通过连续去噪实现一步语言建模
Lee, Chanhyuk, Yoo, Jaehoon, Agarwal, Manan, Shah, Sheel, Huang, Jerry, Raghunathan, Aditi, Hong, Seunghoon, Boffi, Nicholas M., Kim, Jinwoo
Abstract
Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at https://github.com/david3684/flm.
Chinese Translation
基于离散扩散的语言模型因其潜在的生成速度快于自回归模型而受到广泛关注。然而,在实际应用中,它们在少步生成时样本质量急剧下降,未能实现这一承诺。在本文中,我们展示了利用基于流的连续去噪的语言模型在质量和速度上均能超越离散扩散模型。通过重新审视离散模态上的流的基本原理,我们构建了一个基于流的语言模型(Flow-based Language Model, FLM),该模型对一热编码的标记进行欧几里得去噪。我们表明,该模型可以通过交叉熵目标预测干净数据进行训练,并引入了一种简单的时间重参数化方法,大大提高了训练稳定性和生成质量。通过将FLM蒸馏为其相关的流映射,我们获得了一个蒸馏流映射语言模型(Distilled Flow Map Language Model, FMLM),能够进行少步生成。在LM1B和OWT语言数据集上,FLM的生成质量达到了与最先进的离散扩散模型相匹配的水平。使用FMLM,我们的方法在各个方面超越了最近的少步语言模型,其一步生成的质量超过了它们的八步生成质量。我们的工作质疑了广泛接受的假设,即离散扩散过程对于离散模态的生成建模是必要的,并为大规模加速的基于流的语言建模铺平了道路。代码可在 https://github.com/david3684/flm 获取。
cs.CL / 4 / 2602.16836

Claim Automation using Large Language Model

基于大型语言模型的索赔自动化
Mo, Zhengda, Quan, Zhiyu, O'Donohue, Eli, Zhong, Kaiwen
Abstract
While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.
Chinese Translation
尽管大型语言模型(LLMs)在通用语言任务上取得了强劲的表现,但它们在受监管和数据敏感领域(包括保险)的应用仍然有限。我们利用数百万条历史保修索赔,提出了一种本地部署的治理意识语言建模组件,该组件能够从非结构化索赔叙述中生成结构化的纠正措施建议。我们使用低秩适应(Low-Rank Adaptation, LoRA)对预训练的LLMs进行微调,将模型范围限定在索赔处理流程中的初始决策模块,以加快索赔调整员的决策速度。我们使用一个多维评估框架对该模块进行评估,该框架结合了自动语义相似性指标和人工评估,从而实现对实际效用和预测准确性的严格检验。我们的结果表明,特定领域的微调显著优于商业通用和基于提示的LLMs,约80%的评估案例与真实的纠正措施实现了近乎完全一致的匹配。总体而言,本研究提供了理论和实证证据,证明领域自适应微调可以使模型输出分布更紧密地与现实世界的操作数据对齐,展示了其作为保险应用中可靠和可治理的构建模块的潜力。
cs.CL / 5 / 2602.16843

BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

BanglaSummEval:针对孟加拉语摘要的无参考事实一致性评估
Rafid, Ahmed, Adib, Rumman, Ahmed, Fariya, Abrar, Ajwad, Islam, Mohammed Saidul
Abstract
Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $\rho = 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.
Chinese Translation
评估事实一致性对于可靠的文本摘要至关重要,特别是在医疗和新闻等高风险领域。然而,大多数现有的评估指标忽视了孟加拉语,这是一种广泛使用但资源匮乏的语言,并且通常依赖于参考摘要。我们提出了BanglaSummEval,这是一个基于无参考、问答的框架,用于评估孟加拉语摘要的事实一致性。该方法通过从源文档和摘要中自动生成的问题和答案来评估事实准确性和内容覆盖率。一个单一的多语言指令调优语言模型处理问题生成、问题回答、候选答案提取和问题重要性加权。这种统一设计降低了系统复杂性和计算成本。为了捕捉超越表面重叠的语义一致性,我们使用BERTScore-Recall进行答案比较。我们在来自教育和医疗领域的300个人工撰写的摘要上验证了BanglaSummEval,显示出与专家人类判断的强相关性(Pearson's $r = 0.694$, Spearman's $ ho = 0.763$)。通过提供可解释的逐步诊断以及可靠的评估分数,BanglaSummEval为低资源语言环境中的事实一致性评估提供了一个实用且透明的解决方案。
cs.CL / 6 / 2602.16852

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

梅恩兹仍然是梅恩兹,但大型语言模型并不会说它的方言
Bui, Minh Duc, Mager, Manuel, Kann, Peter Herbert, von der Wense, Katharina
Abstract
Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.
Chinese Translation
梅恩泽里希(Meenzerisch)是德国美因茨市所讲的方言,也是美因茨狂欢节的传统语言,这一年度庆典在德国广为人知。然而,梅恩泽里希正面临灭绝的危机,这一命运与许多其他德国语言方言相似。自然语言处理(NLP)有潜力帮助语言和方言的保护与复兴工作。然而,迄今为止,没有NLP研究关注梅恩泽里希。本研究首次在NLP领域专注于美因茨方言。我们介绍了一个数字词典——一个基于现有资源(Schramm, 1966)而生成的NLP准备数据集,以支持研究人员对该语言进行建模和基准测试。该数据集包含2,351个方言词汇及其在标准德语中的释义。然后,我们使用该数据集回答以下研究问题:(1)最先进的大型语言模型(LLMs)能否为方言词汇生成定义?(2)LLMs能否根据定义生成梅恩泽里希词汇?我们的实验表明,LLMs在这两方面均未能成功:定义生成的最佳模型仅达到6.27%的准确率,而词汇生成模型的准确率为1.51%。随后,我们进行了两个额外实验,以观察通过少量学习和从训练集中提取规则(然后传递给LLM)是否能提高准确率。尽管这些方法能够改善结果,但准确率仍低于10%。这突显出迫切需要更多资源和加强对德国语言方言的研究努力。
cs.CL / 7 / 2602.16922

A Conceptual Hybrid Framework for Post-Quantum Security: Integrating BB84 QKD, AES, and Bio-inspired Mechanisms

后量子安全的概念混合框架:整合BB84量子密钥分发、AES和生物启发机制
Abir, Md. Ismiel Hossen
Abstract
Quantum computing is a significant risk to classical cryptographic, especially RSA, which depends on the difficulty of factoring large numbers. Classical factorization methods, such as Trial Division and Pollard's Rho, are inefficient for large keys, while Shor's quantum algorithm can break RSA efficiently in polynomial time. This research studies RSA's vulnerabilities under both classical and quantum attacks and designs a hybrid security framework to ensure data protection in the post-quantum era. The conceptual framework combines AES encryption for classical security, BB84 Quantum Key Distribution (QKD) for secure key exchange with eavesdropping detection, quantum state comparison for lightweight authentication, and a bio-inspired immune system for adaptive threat detection. RSA is vulnerable to Shor's algorithm, BB84 achieves full key agreement in ideal conditions, and it detects eavesdropping with high accuracy. The conceptual model includes both classical and quantum security methods, providing a scalable and adaptive solution for Post-Quantum encryption data protection. This work primarily proposes a conceptual framework. Detailed implementation, security proofs, and extensive experimental validation are considered future work.
Chinese Translation
量子计算对经典密码学,特别是依赖于大数分解难度的RSA,构成了重大风险。经典分解方法,如试除法和波拉德的Rho算法,对于大密钥而言效率低下,而肖尔的量子算法能够在多项式时间内有效破解RSA。本研究探讨了RSA在经典和量子攻击下的脆弱性,并设计了一个混合安全框架,以确保后量子时代的数据保护。该概念框架结合了AES加密以提供经典安全性,BB84量子密钥分发(QKD)以实现安全的密钥交换和窃听检测,量子态比较以实现轻量级认证,以及生物启发的免疫系统以实现自适应威胁检测。RSA对肖尔算法脆弱,BB84在理想条件下实现完全密钥协议,并以高精度检测窃听。该概念模型包括经典和量子安全方法,为后量子加密数据保护提供了可扩展和自适应的解决方案。本研究主要提出了一个概念框架,详细的实施、安全证明和广泛的实验验证将作为未来工作考虑。
cs.CL / 8 / 2602.16938

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

ConvApparel:用于对话推荐系统中用户模拟器的基准数据集和验证框架
Meshi, Ofer, Balog, Krisztian, Goldman, Sally, Caciularu, Avi, Tennenholtz, Guy, Jeong, Jihwan, Globerson, Amir, Boutilier, Craig
Abstract
The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
Chinese Translation
基于大型语言模型(LLM)的用户模拟器在提升对话人工智能(AI)方面的潜力受到一个关键的“现实差距”的制约,这导致系统在模拟交互中表现良好,但在现实世界中可能表现不佳。我们提出了ConvApparel,一个旨在解决这一差距的人机对话新数据集。其独特的双代理数据收集协议——同时使用“优秀”和“劣质”推荐者——通过捕捉广泛的用户体验并丰富用户满意度的第一人称注释,实现了反事实验证。我们提出了一个综合验证框架,结合统计对齐、人类相似度评分和反事实验证,以测试模型的泛化能力。我们的实验揭示了所有模拟器之间存在显著的现实差距。然而,该框架还表明,数据驱动的模拟器在反事实验证中表现优于提示基线,特别是在适应未见行为时更为真实,暗示它们具备更为稳健(尽管不完美)的用户模型。
cs.CL / 9 / 2602.16957

When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English

当语义重叠不足时:土耳其语与英语之间的跨语言委婉语转移
Biyik, Hasan Can, Barak, Libby, Peng, Jing, Feldman, Anna
Abstract
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.
Chinese Translation
委婉语替代社会敏感表达,通常会软化或重新框定意义,而它们对文化和语用背景的依赖使得跨语言建模变得复杂。在本研究中,我们探讨了跨语言等价性如何影响多语言委婉语检测中的转移。我们将土耳其语和英语中的潜在委婉语术语(Potentially Euphemistic Terms, PETs)根据其功能、语用和语义对齐情况分为重叠子集(Overlapping, OPETs)和非重叠子集(Non-Overlapping, NOPETs)。我们的研究结果揭示了一种转移不对称性:语义重叠不足以保证正向转移,尤其是在低资源的土耳其语到英语的方向上,重叠委婉语的表现甚至可能下降,而在某些情况下,基于NOPET的训练则可能改善表现。标签分布的差异有助于解释这些反直觉的结果。类别级分析表明,转移可能受到特定领域对齐的影响,尽管证据因稀疏性而有限。
cs.CL / 10 / 2602.16959

Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

特征情绪空间:对经典波斯诗歌心理模式的不确定性感知谱图分析
Shahnazari, Kourosh, Ayyoubzadeh, Seyed Moein, Keshtparvar, Mohammadali
Abstract
Classical Persian poetry is a historically sustained archive in which affective life is expressed through metaphor, intertextual convention, and rhetorical indirection. These properties make close reading indispensable while limiting reproducible comparison at scale. We present an uncertainty-aware computational framework for poet-level psychological analysis based on large-scale automatic multi-label annotation. Each verse is associated with a set of psychological concepts, per-label confidence scores, and an abstention flag that signals insufficient evidence. We aggregate confidence-weighted evidence into a Poet $\times$ Concept matrix, interpret each poet as a probability distribution over concepts, and quantify poetic individuality as divergence from a corpus baseline using Jensen--Shannon divergence and Kullback--Leibler divergence. To capture relational structure beyond marginals, we build a confidence-weighted co-occurrence graph over concepts and define an Eigenmood embedding through Laplacian spectral decomposition. On a corpus of 61{,}573 verses across 10 poets, 22.2\% of verses are abstained, underscoring the analytical importance of uncertainty. We further report sensitivity analysis under confidence thresholding, selection-bias diagnostics that treat abstention as a category, and a distant-to-close workflow that retrieves verse-level exemplars along Eigenmood axes. The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.
Chinese Translation
经典波斯诗歌是一个历史悠久的档案,其中通过隐喻、互文惯例和修辞间接表达情感生活。这些特性使得细读不可或缺,同时限制了大规模的可重复比较。我们提出了一种基于大规模自动多标签注释的诗人级心理分析的不确定性感知计算框架。每一节诗与一组心理概念、每个标签的置信度分数以及一个表示证据不足的弃权标志相关联。我们将置信度加权的证据汇总成一个诗人 × 概念矩阵,将每位诗人视为对概念的概率分布,并使用詹森-香农散度(Jensen–Shannon divergence)和库尔巴克-莱布勒散度(Kullback–Leibler divergence)量化诗歌个体性与语料库基线的偏离。为了捕捉超出边际的关系结构,我们构建了一个基于概念的置信度加权共现图,并通过拉普拉斯谱分解定义了特征情绪嵌入。在一个包含61,573节诗的语料库中,22.2%的诗句被弃权,强调了不确定性的分析重要性。我们进一步报告了在置信度阈值下的敏感性分析、将弃权视为一个类别的选择偏倚诊断,以及一个从远到近的工作流程,该流程沿特征情绪轴检索诗句级别的示例。所得到的框架支持可扩展、可审计的数字人文学科分析,同时通过将不确定性从诗句级证据传播到诗人级推断,保持了解释的谨慎。
cs.CL / 11 / 2602.17003

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Persona2Web:基于用户历史的上下文推理个性化网络代理基准测试
Kim, Serin, Lee, Sangam, Lee, Dongha
Abstract
Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://anonymous.4open.science/r/Persona2Web-73E8.
Chinese Translation
大型语言模型推动了网络代理的发展,但当前的代理缺乏个性化能力。由于用户很少会明确指定其意图的每一个细节,实用的网络代理必须能够通过推断用户偏好和上下文来解释模糊查询。为了解决这一挑战,我们提出了Persona2Web,这是第一个用于评估真实开放网络上个性化网络代理的基准,基于“澄清以实现个性化”(clarify-to-personalize)原则,该原则要求代理根据用户历史来解决模糊性,而不是依赖于明确的指令。Persona2Web包括:(1)揭示用户偏好的用户历史,这些偏好在较长时间跨度内隐含存在;(2)需要代理推断隐含用户偏好的模糊查询;(3)一个关注推理的评估框架,能够对个性化进行细致的评估。我们在各种代理架构、骨干模型、历史访问方案和不同模糊级别的查询上进行了广泛的实验,揭示了个性化网络代理行为中的关键挑战。为了可重复性,我们的代码和数据集已公开可用,网址为 https://anonymous.4open.science/r/Persona2Web-73E8。
cs.CL / 12 / 2602.17022

ReIn: Conversational Error Recovery with Reasoning Inception

ReIn:基于推理引导的对话错误恢复
Kim, Takyoung, Nam, Jinseok, Basu, Chandrayee, Fan, Xing, Ma, Chengyuan, Ji, Heng, Tur, Gokhan, Hakkani-Tür, Dilek
Abstract
Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.
Chinese Translation
由大型语言模型(LLMs)驱动的对话代理在固定任务导向的对话数据集上表现出色,但仍然容易受到用户引发的意外错误的影响。本研究不专注于错误预防,而是关注错误恢复,这需要准确诊断错误的对话上下文并执行适当的恢复计划。在由于显著的成本和时间要求而无法进行模型微调或提示修改的现实约束下,我们探讨代理是否能够从上下文存在缺陷的交互中恢复,以及如何在不改变模型参数和提示的情况下调整其行为。为此,我们提出了推理引导(Reasoning Inception,ReIn),这是一种测试时干预方法,将初步推理植入代理的决策过程中。具体而言,外部引导模块识别对话上下文中的预定义错误并生成恢复计划,这些计划随后被整合到代理的内部推理过程中,以指导纠正行动,而无需修改其参数或系统提示。我们通过系统模拟直接妨碍用户目标成功完成的对话失败场景来评估ReIn:用户的模糊和不支持请求。在多种代理模型和引导模块的组合中,ReIn显著提高了任务成功率,并对未见过的错误类型具有良好的泛化能力。此外,它在性能上始终优于显式的提示修改方法,突显了其作为一种高效的即时方法的实用性。对其操作机制的深入分析,特别是在指令层次结构方面,表明与ReIn共同定义恢复工具可以作为一种安全有效的策略,以在不修改基础模型或系统提示的情况下提高对话代理的韧性。
cs.CL / 13 / 2602.17045

Large Language Models Persuade Without Planning Theory of Mind

大型语言模型在没有计划的心智理论下进行劝说
Moore, Jared, Overmark, Rasmus, Cooper, Ned, Cibralic, Beba, Haber, Nick, Jones, Cameron R.
Abstract
A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.
Chinese Translation
越来越多的研究试图通过静态的、非互动的问答基准来评估人类和大型语言模型(LLMs)的心智理论(ToM)能力。然而,该领域的理论研究表明,第一人称互动是心智理论的重要组成部分,而此类预测性、旁观者任务可能无法有效评估这一能力。我们通过一个新颖的心智理论任务来填补这一空白,该任务要求代理人通过战略性地揭示信息来说服目标选择三项政策提案中的一项。成功取决于劝说者对目标知识状态(目标对政策的了解程度)和动机状态(目标对不同结果的重视程度)的敏感性。我们改变了这些状态是被揭示给劝说者还是被隐藏,在后者的情况下,劝说者必须询问或推断这些状态。在实验1中,参与者劝说一个被编程为仅进行理性推理的机器人。在被揭示的条件下,LLMs表现出色,但在被隐藏的条件下表现低于随机水平,表明在引出和使用心理状态信息所需的多步骤规划上存在困难。人类在两个条件下的表现中等,表明他们能够进行这种规划。在实验2中,参与者扮演机器人角色,而在实验3中,我们测量了人类目标的真实信念是否发生变化,LLMs在所有条件下均优于人类劝说者。这些结果表明,有效的劝说可以在没有明确的心智理论推理(例如,通过修辞策略)下发生,并且LLMs在这种劝说形式中表现出色。总体而言,我们的结果提醒人们不要将类人心智理论归因于LLMs,同时突显了LLMs影响人们信念和行为的潜力。
cs.CL / 14 / 2602.17051

Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

评估跨语言分类方法以支持多语言社交媒体数据中的主题发现
Uniyal, Deepak, Bashar, Md Abul, Nayak, Richi
Abstract
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.
Chinese Translation
分析多语言社交媒体话语仍然是自然语言处理中的一大挑战,特别是在大规模公共辩论跨越多种语言时。本研究探讨了不同的跨语言文本分类方法如何支持对全球对话的可靠分析。以氢能为案例,我们分析了一个涵盖超过九百万条推文的十年数据集,这些推文使用英语、日语、印地语和韩语(2013-2022)进行主题发现。基于在线关键词驱动的数据收集导致了大量无关内容的产生。我们探索了四种过滤相关内容的方法:(1)将英语注释数据翻译成目标语言,以构建每种目标语言的特定语言模型;(2)将来自所有语言的未标记数据翻译成英语,以创建基于英语注释的单一模型;(3)将经过微调的英语多语言变换器直接应用于每种目标语言的数据;(4)结合翻译注释与多语言训练的混合策略。我们评估了每种方法在从嘈杂的基于关键词的集合中过滤与氢相关的推文的能力。随后,进行主题建模以提取相关子集中占主导地位的主题。结果突显了翻译方法与多语言方法之间的关键权衡,为优化大规模社交媒体分析的跨语言流程提供了可行的见解。
cs.CL / 15 / 2602.17054

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning

ALPS:阿拉伯语言与语用推理的诊断挑战集
Al-Olimat, Hussein S., Alshareef, Ahmad
Abstract
While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.
Chinese Translation
尽管近期的阿拉伯自然语言处理基准测试关注于规模,但它们往往依赖于合成或翻译数据,这些数据可能需要更深入的语言学验证。我们介绍了ALPS(阿拉伯语言与语用套件),这是一个由专家精心策划的本土诊断挑战集,旨在探讨深层语义和语用能力,这些能力补充了专门的大规模基准测试。尽管广覆盖基准测试优先考虑规模和多任务覆盖,ALPS则通过531个严格设计的问题,涵盖15个任务和47个子任务,聚焦于语言理解的深度。我们在阿拉伯语言学方面拥有深厚的专业知识,确保了文化的真实性,并消除了翻译伪影。我们对23个不同的模型(商业、开源和阿拉伯本土模型)进行了评估,基于单次人类表现(平均准确率84.6%)和专家裁定的权威标准(99.2%),揭示了一个关键的分离现象:模型在流畅性上表现出色,但在基本的形态句法依赖上失败,在依赖于发音符号的任务中,形态句法依赖的错误率高达36.5%,而在组合语义上则相对较低。尽管顶级商业模型(Gemini-3-flash,94.2%)超越了平均单个人类表现,但商业巨头与阿拉伯本土模型之间仍存在显著差距,最佳的阿拉伯特定模型(Jais-2-70B,83.6%)接近但未能匹配人类表现。
cs.CL / 16 / 2602.17072

BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

BankMathBench:银行场景中的数值推理基准
Lee, Yunseung, Kim, Subin, Kwak, Youngjun, Choo, Jaegul
Abstract
Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.
Chinese Translation
基于大型语言模型(LLMs)的聊天机器人在金融领域,尤其是在数字银行中,越来越多地被用于处理客户关于存款、储蓄和贷款等产品的咨询。然而,这些模型在核心银行计算方面仍然表现出较低的准确性,包括总支付估算、不同利率产品的比较以及提前还款条件下的利息计算。这些任务需要多步骤的数值推理和对银行产品的上下文理解,但现有的LLMs往往会出现系统性错误——误解产品类型、错误应用条件或无法进行基本的指数和几何级数计算。然而,这些错误在现有基准中很少被捕捉到。数学数据集主要集中于基础数学问题,而金融基准主要针对金融文件,导致日常银行场景未得到充分探索。为了解决这一局限性,我们提出了BankMathBench,一个反映现实银行任务的特定领域数据集。BankMathBench按难度分为三个级别——基础、中级和高级,分别对应单一产品推理、多产品比较和多条件场景。在BankMathBench上进行训练的开源LLMs在公式生成和数值推理准确性方面表现出显著改善,证明了该数据集在增强特定领域推理方面的有效性。通过工具增强的微调,这些模型在基础(平均准确率提高57.6个百分点)、中级(提高75.1个百分点)和高级(提高62.9个百分点)任务上取得了显著的进步,代表了相对于零-shot 基线的显著提升。这些发现突显了BankMathBench作为评估和推动LLMs在现实银行场景中数值推理能力的可靠基准。
cs.CL / 17 / 2602.17108

Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

使用主题知觉测试对大型多模态模型进行投射心理评估
Dzega, Anton, Elyashar, Aviad, Slobodin, Ortal, Cohen, Odeya, Puzis, Rami
Abstract
Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.
Chinese Translation
主题知觉测试(Thematic Apperception Test, TAT)是一种基于心理测量的多维评估框架,系统地区分了人格功能的认知-表征和情感-关系成分。该测试是一种投射心理框架,旨在揭示人格的无意识方面。本研究考察了大型多模态模型(Large Multimodal Models, LMMs)的个性特征是否可以通过非语言基础的方式进行评估,使用社会认知与对象关系量表-全球版(Social Cognition and Object Relations Scale - Global, SCORS-G)。LMMs在两个不同的角色中被使用:作为主体模型(Subject Models, SMs),根据TAT图像生成故事;作为评估模型(Evaluator Models, EMs),使用SCORS-G框架评估这些叙述。评估者表现出卓越的理解和分析TAT反应的能力,他们的解释与人类专家的高度一致。评估结果强调所有模型对人际动态的理解非常好,并且对自我概念有良好的把握。然而,它们在感知和调节攻击性方面始终存在不足。不同模型家族的表现系统性地存在差异,较大且更新的模型在SCORS-G维度上始终优于较小和较早的模型。
cs.CL / 18 / 2602.17127

The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

实验驱动的对齐特征的出现:用于审计生成性人工智能中潜在偏见和复合风险的心理测量框架
Bosnjakovic, Dusan
Abstract
As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.
Chinese Translation
随着大型语言模型(LLMs)从独立的聊天界面转变为多智能体系统和递归评估循环中的基础推理层(LLM-as-a-judge),检测持久的、提供者级别的行为特征成为安全性和治理的关键需求。传统基准测量瞬时任务准确性,但未能捕捉稳定的、潜在的响应策略——在训练和对齐过程中嵌入的“主导心态”,这些心态超越了单个模型版本。本文提出了一种新颖的审计框架,利用心理测量理论——特别是在序数不确定性下的潜在特征估计——来量化这些倾向,而无需依赖真实标签。通过使用被语义正交诱饵掩盖的强制选择序数小品,并受密码学置换不变性的支配,研究审计了九个领先模型在优化偏见、阿谀奉承和现状合法化等维度上的表现。通过混合线性模型(MixedLM)和组内相关系数(ICC)分析,研究发现,尽管项目级框架驱动了高方差,但持久的“实验信号”解释了显著的行为聚类。这些发现表明,在“锁定”的提供者生态系统中,潜在偏见不仅仅是静态错误,而是复合变量,存在风险,可能在多层次人工智能架构中创造递归的意识形态回声室。
cs.CL / 19 / 2602.17194

What Makes a Good Doctor Response? An Analysis on a Romanian Telemedicine Platform

什么样的医生回应是好的?对罗马尼亚远程医疗平台的分析
Cosma, Adrian, Dumitrache, Cosmin, Radoi, Emilian
Abstract
Text-based telemedicine has become a common mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of 77,334 anonymised patient question--doctor response pairs, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that patient and clinician history features dominate prediction, functioning as strong priors, while characteristics of the response text provide a smaller but, crucially, actionable signal. In subgroup correlation analyses, politeness and hedging are consistently positively associated with patient feedback, whereas lexical diversity shows a negative association.
Chinese Translation
基于文本的远程医疗已成为一种常见的护理模式,要求临床医生以清晰有效的书面形式提供医疗建议。随着平台越来越依赖患者评分和反馈,临床医生面临着维持满意度评分的压力,尽管这些评估往往更多地反映了沟通质量而非临床准确性。我们分析了罗马尼亚基于文本的远程医疗中的患者满意度信号。使用77,334对匿名患者提问与医生回应的样本,我们将反馈建模为二元结果,将点赞回应视为正面反馈,并将负面或缺失反馈归为另一类。我们提取了可解释的、主要是语言无关的特征(例如,长度、结构特征、可读性代理),以及可用的罗马尼亚LIWC心理语言学特征和礼貌/模糊标记。我们使用基于时间的划分训练分类器,并进行SHAP分析,结果表明患者和临床医生的历史特征在预测中占主导地位,作为强先验,而回应文本的特征提供了较小但至关重要的可操作信号。在子组相关性分析中,礼貌和模糊性与患者反馈始终呈正相关,而词汇多样性则显示出负相关。
cs.CL / 20 / 2602.17262

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

量化与减轻大型语言模型中的社会期望反应:一项符合期望的分级强迫选择心理测量研究
Okada, Kensuke, Furukawa, Yui, Bunji, Kyosuke
Abstract
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
Chinese Translation
人类自我报告问卷在自然语言处理(NLP)中越来越多地用于基准测试和审计大型语言模型(LLMs),涵盖从个性一致性到安全性和偏见评估等多个方面。然而,这些工具假设受访者会诚实作答;在评估环境中,LLMs可能倾向于选择社会上更受欢迎的答案——即社会期望反应(SDR)——从而影响问卷得分和后续结论。我们提出了一种心理测量框架,以量化和减轻LLMs问卷评估中的SDR。为了量化SDR,我们在HONEST与FAKE-GOOD指令下施测相同的问卷,并将SDR计算为来自项目反应理论(IRT)估计的潜在得分的方向修正标准化效应大小。这使得跨构念和反应格式的比较成为可能,并且可以与人类的指令伪造基准进行对比。为了减轻SDR,我们通过约束优化从项目池中选择30对跨领域的项目,构建了一个分级强迫选择(GFC)的大五人格问卷,以匹配期望值。在对九个经过指令调整的LLMs进行评估时,这些模型在已知目标特征的合成个性上,Likert风格的问卷显示出一致的较大SDR,而符合期望的GFC则显著减轻了SDR,同时在很大程度上保留了对预期个性特征的恢复。这些结果突显了模型依赖的SDR恢复权衡,并激励了在基于问卷的LLMs基准测试和审计中采用SDR意识的报告实践。
cs.CL / 21 / 2602.17283

Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

跨语言价值评估的探索:共识-多元主义视角
Chen, Yukun, Zhang, Xinyu, Tang, Jialong, Wan, Yu, Yang, Baosong, Li, Yiming, Qin, Zhan, Ren, Kui
Abstract
While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77\%$), with significant performance disparities across different languages ($\Delta Acc > 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: https://huggingface.co/datasets/Whitolf/X-Value.
Chinese Translation
尽管大型语言模型(LLMs)在内容安全方面变得至关重要,但当前的评估范式主要集中于检测显性的危害(例如,暴力或仇恨言论),而忽视了数字内容中传达的更微妙的价值维度。为了解决这一问题,我们引入了X-Value,一个新颖的跨语言价值评估基准,旨在评估LLMs从全球视角评估内容深层价值的能力。X-Value包含超过5000个问答对,覆盖18种语言,系统地组织为基于施瓦茨(Schwartz)基本人类价值理论的7个核心领域,并分为简单和困难两个级别以便于区分性评估。我们进一步提出了一个独特的两阶段注释框架,首先识别一个问题是否属于全球共识(例如,人权)或多元主义(例如,宗教),然后对内容中潜在的价值进行多方评估。对X-Value的系统评估表明,当前的最先进LLMs在跨语言价值评估中存在不足($Acc < 77 ext{%}$),并且不同语言之间的性能差异显著($ ext{Δ Acc} > 20 ext{%}$)。这项工作突显了迫切需要提高LLMs对细致、价值意识内容评估能力的必要性。我们的X-Value可在以下网址获取:https://huggingface.co/datasets/Whitolf/X-Value。
cs.CL / 22 / 2602.17287

Representation Collapse in Machine Translation Through the Lens of Angular Dispersion

从角度分散的视角看机器翻译中的表示崩溃
Tokarchuk, Evgeniia, Nachesa, Maya K., Troshin, Sergey, Niculae, Vlad
Abstract
Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.
Chinese Translation
基于Transformer架构的现代神经翻译模型以其高性能而闻名,尤其是在高资源数据集上训练时。尽管标准的下一个标记预测训练策略在实践中被广泛采用,但可能导致被忽视的伪影,如表示崩溃。先前的研究表明,这一问题在深层Transformer层的表示中尤为明显,常常未能有效利用几何空间。在端到端的连续输出神经机器翻译的训练中,表示崩溃更为明显,其简单的解决方案是将所有向量设置为相同的值。在本研究中,我们分析了在训练过程中离散和连续NMT Transformer的不同层次上表示崩溃的动态。我们结合了一种基于角度分散的现有正则化方法,并通过实证证明它不仅减轻了崩溃现象,还提高了翻译质量。此外,我们还展示了量化模型表现出类似的崩溃行为,并且正则化的好处在量化后仍然得以保留。
cs.CL / 23 / 2602.17316

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

相同含义,不同评分:大型语言模型评估中的词汇和句法敏感性
Kostić, Bogdan, Fallon, Conor, Risch, Julian, Löser, Alexander
Abstract
The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
Chinese Translation
大型语言模型(LLMs)的快速发展使得标准化评估基准成为模型比较的主要工具。然而,由于对输入提示中浅层变化的敏感性,其可靠性正受到越来越多的质疑。本文研究了受控的、真值条件等价的词汇和句法扰动如何影响23个当代LLM在三个基准(MMLU、SQuAD和AMEGA)上的绝对性能和相对排名。我们采用两种语言学原则驱动的流程生成保留意义的变体:一种通过同义词替换进行词汇变化,另一种使用依存句法分析确定适用的句法变换。结果表明,词汇扰动在几乎所有模型和任务中都持续引起显著的性能下降,而句法扰动的影响则更为异质,偶尔会改善结果。两种扰动类型在复杂任务上都使模型排行榜不稳定。此外,模型的鲁棒性并未随着模型规模的增加而一致提升,显示出强烈的任务依赖性。总体而言,研究结果表明,LLMs更依赖于表层词汇模式而非抽象语言能力,强调了将鲁棒性测试作为LLM评估标准组成部分的必要性。
cs.CL / 24 / 2602.17366

RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

RPDR:基于往返预测的数据增强框架用于长尾问答
Zhang, Yiming, Zhang, Siyue, Zhao, Junbo, Zhao, Chen
Abstract
Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
Chinese Translation
长尾问答对大型语言模型(LLMs)提出了重大挑战,因为它们在获取和准确回忆不常见知识方面的能力有限。检索增强生成(RAG)系统通过整合外部检索机制,在缓解这一限制方面显示出了巨大的潜力。然而,密集检索模型在推广到稀有或小众知识时常常面临相同的困难。在本研究中,我们介绍了RPDR,一种新颖的数据增强框架,旨在选择高质量、易于学习的训练数据,以增强密集检索器。我们的方法围绕三个核心组件构建:合成数据生成、通过往返预测选择易于学习的实例,以及使用这些实例进行检索器训练。我们在两个长尾检索基准(PopQA和EntityQuestion)上评估了RPDR,显示出相较于现有检索器(如BM25和Contriver)的显著改进,尤其是在极端长尾类别上。通过详细的人类分析,我们识别了RPDR的优缺点,并提出了一种动态路由机制,以动态地将查询路由到专门的检索模块,从而进一步提高检索性能。
cs.CL / 25 / 2602.17377

The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

可得性启发在多项选择回答行为中的作用
Zotos, Leonidas, van Rijn, Hedderik, Nissim, Malvina
Abstract
When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.
Chinese Translation
当学生对多项选择题(MCQ)的正确答案不确定时,猜测是常见的做法。可得性启发(availability heuristic),由A. Tversky和D. Kahneman于1973年提出,表明相关实例在脑海中浮现的容易程度,通常通过接触频率来操作,可以为考生在不知道确切答案的问题中提供心理捷径。仅仅选择最容易想到的选项是否是回答MCQ的良好策略?我们提出了一种评估MCQ选项认知可得性的计算方法,该方法通过概念在大型语料库中的流行程度进行操作。我们的关键发现是,在三个大型题集的研究中,正确答案的可得性显著高于错误的MCQ选项,且与问题的表述无关。具体而言,使用维基百科作为检索语料库,我们发现始终选择最可得的选项的得分比随机猜测的基线高出13.5%至32.9%。我们进一步发现,LLM生成的MCQ选项与专家创建的选项在可得性方面显示出类似的模式,尽管LLM的频率特性及其在大量文本数据上的训练。我们的研究结果表明,在当前和未来的工作中,在计算建模学生行为时应考虑可得性。
cs.CL / 26 / 2602.17424

Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

多样化的词汇选择,相同的指代:注释词汇丰富的跨文档指代关系
Zhukova, Anastasia, Hamborg, Felix, Donnay, Karsten, Meuschke, Norman, Gipp, Bela
Abstract
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.
Chinese Translation
跨文档指代消解(CDCR)识别并链接相关文档中相同实体和事件的提及,从而实现对话参与者层面的信息聚合内容分析。然而,现有数据集主要集中于事件消解,并采用狭义的指代定义,这限制了它们在分析用词差异显著的多样化和极化新闻报道中的有效性。本文提出了一种修订的NewsWCL50数据集的CDCR注释方案,将指代链视为话语元素(DEs)和分析的概念单元。该方法兼顾身份和近身份关系,例如,通过链接“车队” - “寻求庇护者” - “那些考虑非法入境的人”,使模型能够捕捉媒体话语中的词汇多样性和框架变化,同时保持DEs的细粒度注释。我们使用统一的编码手册重新注释了NewsWCL50及其子集ECB+,并通过词汇多样性指标和同头词基线评估新数据集。结果表明,重新注释的数据集与原始的ECB+和NewsWCL50紧密对齐,从而支持新闻领域中平衡且关注话语的CDCR研究。
cs.CL / 27 / 2602.17425

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

评估极低资源机器翻译:ChrF++与BLEU指标的比较研究
Kumar, Sanjeev, Jyothi, Preethi, Bhattacharyya, Pushpak
Abstract
Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
Chinese Translation
在极低资源语言(ELRL)场景中评估机器翻译(MT)质量面临独特挑战,因为在高资源环境中有效的广泛使用的指标如BLEU,往往在数据稀缺的背景下错误地反映质量。本研究对BLEU这一基于n-gram的指标与ChrF++这一基于字符的指标在ELRL环境中的机器翻译评估进行了比较分析。我们考察了每个指标对翻译伪影的响应,包括幻觉、重复、源文本复制以及在三个极低资源语言(Magahi、Bhojpuri和Chhattisgarhi)中的变音符号(matra)变化,重点关注大型语言模型(LLMs)和神经机器翻译(NMT)系统的输出。尽管近期的研究通常仅依赖于ChrF++,我们的发现表明,尽管BLEU的绝对分数较低,但它提供了互补的词汇精确度见解,从而提高了可解释性。
cs.CL / 28 / 2602.17431

Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

长文本语言模型输出的细粒度不确定性量化:比较研究
Bouchard, Dylan, Chauhan, Mohit Singh, Bajaj, Viren, Skarbrevik, David
Abstract
Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.
Chinese Translation
不确定性量化已成为一种有效的方法,用于检测大型语言模型(LLMs)的闭卷幻觉,但现有方法主要针对短文本输出,难以很好地推广到长文本生成。我们提出了一种针对长文本LLM输出的细粒度不确定性量化的分类法,该分类法根据三个阶段的设计选择对方法进行区分:响应分解、单元级评分和响应级聚合。我们形式化了几类基于一致性的黑箱评分器,提供了现有方法的推广和扩展。在我们对多个LLM和数据集的实验中,我们发现1)声明-响应蕴含的表现始终优于或与更复杂的声明级评分器相当,2)声明级评分通常优于句子级评分,3)关注不确定性的解码在提高长文本输出的事实性方面非常有效。我们的框架阐明了先前方法之间的关系,使得能够进行公正的比较,并为选择细粒度不确定性量化的组件提供了实用指导。
cs.CL / 29 / 2602.17443

AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

AIDG:评估多轮对话中信息提取与信息包含之间的非对称性
Sakhawat, Adib, Sadab, Fardeen, Shahriar, Rakin
Abstract
Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.
Chinese Translation
评估大型语言模型(LLMs)的战略推理能力需要超越静态基准,转向动态的多轮互动。我们引入了AIDG(对抗性信息推理游戏),这是一个博弈论框架,探讨对话中信息提取(主动推理)与信息包含(状态维护)之间的非对称性。我们提出了两个互补的任务:AIDG-I,测量社会推理中的务实策略,以及AIDG-II,在结构化的“20个问题”设置中测量约束满足。在与六个前沿LLM进行的439场游戏中,我们观察到明显的能力非对称性:模型在信息包含方面的表现显著优于信息提取,防御时具有350 ELO的优势(Cohen's d = 5.47)。我们识别出导致这一差距的两个瓶颈:(1)信息动态性,其中确认策略的有效性是盲目推理的7.75倍(p < 0.00001);(2)约束遵循,在对话负载下,遵循指令的能力下降,导致41.3%的推理失败。这些发现表明,尽管LLMs在局部防御一致性方面表现出色,但在进行战略询问所需的全局状态跟踪方面却面临挑战。
cs.CL / 30 / 2602.17445

ABCD: All Biases Come Disguised

ABCD:所有偏见都以伪装的形式出现
Nowak, Mateusz, Cadet, Xavier, Chin, Peter
Abstract
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.
Chinese Translation
多项选择题(MCQ)基准测试一直是评估大型语言模型(LLMs)推理能力和回答基于知识问题能力的标准实践。通过合成的 NonsenseQA 基准,我们观察到不同的 LLMs 显示出不同程度的标签位置少量提示偏见,其中模型要么使用答案位置,要么使用答案前的标签,要么使用少量提示中正确答案的分布,或者将所有这些因素结合起来回答每个 MCQ 问题。我们提出了一种简单的减少偏见的评估协议,该协议用统一的无序标签替换每个问题的标签,并提示 LLM 使用呈现的完整答案。通过一个简单的句子相似度模型,我们展示了在不同答案排列之间提高了鲁棒性,并且标准差降低,LLM 的性能仅有微小下降,揭示了 LLM 在减少评估伪影下的能力,而不依赖于提示示例或选项标签。在多个基准和模型中,该协议显著提高了对答案排列的鲁棒性,平均准确率方差减少了 $3 imes$,而模型性能仅有微小下降。通过对各种嵌入模型和相似度函数的消融研究,我们表明该方法比标准方法更具鲁棒性。
cs.CL / 31 / 2602.17465

Entropy-Based Data Selection for Language Models

基于熵的数据选择用于语言模型
Li, Hongming, Liu, Yang, Huang, Chao
Abstract
Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.
Chinese Translation
现代语言模型(LMs)越来越依赖于两种关键资源:计算资源和数据资源。数据选择技术可以有效减少微调语言模型所需的训练数据量。然而,它们的有效性与计算资源密切相关,而计算资源通常需要高昂的预算。由于在实际微调场景中的资源限制,我们系统性地揭示了数据选择与所选数据的不确定性估计之间的关系。尽管大型语言模型(LLMs)在语言理解和生成方面展现出卓越的能力,为缓解数据稀缺提供了新的方法,但评估数据的可用性仍然是一项具有挑战性的任务。这使得高效的数据选择变得不可或缺。为了解决这些问题,我们提出了基于熵的无监督数据选择(EUDS)框架。在情感分析(SA)、主题分类(Topic-CLS)和问答(Q&A)任务上的实证实验验证了其有效性。EUDS建立了一种计算高效的数据过滤机制。理论分析和实验结果确认了我们方法的有效性。EUDS显著降低了计算成本,提高了训练时间效率,同时减少了数据需求。这为在计算受限场景中高效微调语言模型提供了一种创新解决方案。
cs.CL / 32 / 2602.17467

PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

PEACE 2.0:针对仇恨言论的实证解释与反言论工具
Damo, Greta, Petiot, Stéphane, Cabrio, Elena, Villata, Serena
Abstract
The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.
Chinese Translation
在线平台上仇恨言论的增加给社会带来了重大挑战。尽管自然语言处理领域已经开发出有效的方法来自动检测仇恨言论的存在,但对其的回应,即反言论,仍然是一个未解决的挑战。我们提出了PEACE 2.0,这是一种新颖的工具,除了分析和解释为何某条信息被视为仇恨言论外,还能够生成相应的回应。更具体地说,PEACE 2.0具有三个主要的新功能:利用检索增强生成(Retrieval-Augmented Generation, RAG)管道 i) 将仇恨言论(HS)解释与证据和事实相结合,ii) 自动生成基于证据的反言论,iii) 探索反言论回复的特征。通过整合这些功能,PEACE 2.0 能够对显性和隐性仇恨信息进行深入分析和回应生成。
cs.CL / 33 / 2602.17469

Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers

审计互惠情感对齐:变换器中的反转风险、方言表现和意图不一致
Lia, Nusrat Jahan, Dipta, Shubhashis Roy
Abstract
The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior. However, this loop fractures significantly across language barriers. Our research addresses Cross-Lingual Sentiment Misalignment between Bengali and English by benchmarking four transformer architectures. We reveal severe safety and representational failures in current alignment paradigms. We demonstrate that compressed model (mDistilBERT) exhibits 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive user intent as negative (or vice versa). Furthermore, we identify systemic nuances affecting human-AI trust, including "Asymmetric Empathy" where some models systematically dampen and others amplify the affective weight of Bengali text relative to its English counterpart. Finally, we reveal a "Modern Bias" in the regional model (IndicBERT), which shows a 57% increase in alignment error when processing formal (Sadhu) Bengali. We argue that equitable human-AI co-evolution requires pluralistic, culturally grounded alignment that respects language and dialectal diversity over universal compression, which fails to preserve the emotional fidelity required for reciprocal human-AI trust. We recommend that alignment benchmarks incorporate "Affective Stability" metrics that explicitly penalize polarity inversions in low-resource and dialectal contexts.
Chinese Translation
双向对齐的核心主题是确保人工智能系统准确理解人类意图,并使人类能够信任人工智能的行为。然而,这一循环在语言障碍下显著破裂。我们的研究通过基准测试四种变换器架构,解决了孟加拉语和英语之间的跨语言情感不一致问题。我们揭示了当前对齐范式中存在严重的安全性和表现性失败。我们展示了压缩模型(mDistilBERT)表现出28.7%的“情感反转率”,根本上将积极的用户意图误解为消极(或反之亦然)。此外,我们识别出影响人类与人工智能信任的系统性细微差别,包括“非对称同理心”,其中某些模型系统性地削弱,而其他模型则增强孟加拉文本相对于其英语对应文本的情感权重。最后,我们揭示了区域模型(IndicBERT)中的“现代偏见”,在处理正式(Sadhu)孟加拉语时,显示出57%的对齐错误增加。我们认为,公平的人类与人工智能共同进化需要多元化的、文化根植的对齐,尊重语言和方言多样性,而不是普遍的压缩,这种压缩未能保持互惠人类与人工智能信任所需的情感忠实度。我们建议对齐基准应纳入明确惩罚低资源和方言背景下极性反转的“情感稳定性”指标。
cs.CL / 34 / 2602.17475

Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

小型语言模型在医学自然语言处理中的应用:对意大利语中的少量示例、约束解码、微调和持续预训练的系统分析
Ferrazzi, Pietro, Franzin, Mattia, Lavelli, Alberto, Magnini, Bernardo
Abstract
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Chinese Translation
大型语言模型(LLMs)在多种医学自然语言处理(NLP)任务中表现出色,但其巨大的计算需求往往限制了在实际医疗环境中的应用。在本研究中,我们探讨了“较小”的语言模型(约十亿参数)是否能够有效执行医学任务,同时保持竞争力的准确性。我们评估了来自三个主要模型家族——Llama-3、Gemma-3和Qwen3——在20个临床NLP任务中的表现,包括命名实体识别、关系提取、病例报告表填写、问答和论证挖掘。我们系统地比较了一系列适应策略,包括推理时的少量示例提示和约束解码,以及训练时的监督微调和持续预训练。微调被证明是最有效的方法,而少量示例提示与约束解码的结合则提供了强有力的低资源替代方案。我们的结果表明,小型语言模型可以与更大的基线模型相匹配,甚至超越它们,其中基于Qwen3-1.7B的最佳配置平均得分比Qwen3-32B高出9.2分。我们发布了所有公开可用的意大利医学NLP任务数据集的全面集合,以及我们的最佳模型。此外,我们还发布了来自意大利医院急诊科的126M词的意大利语数据集,以及用于持续预训练的来自多个来源的175M词的数据集。
cs.CL / 35 / 2602.17513

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

跨越领域鸿沟:从 MIMIC-III 到产科的监督学习与零样本临床部分分割
Karacan, Baris, Di Eugenio, Barbara, Thornton, Patrick
Abstract
Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.
Chinese Translation
临床自由文本笔记包含重要的患者信息。这些笔记被结构化为标记部分;识别这些部分已被证明有助于临床决策和下游自然语言处理任务。在本文中,我们通过三项关键贡献推进临床部分分割。首先,我们整理了一个新的去标识化、部分标记的产科笔记数据集,以补充公共语料库(如 MIMIC-III)中涵盖的医学领域,而大多数现有的分割方法都是在这些数据集上训练的。其次,我们系统地评估了基于变换器的监督模型在 MIMIC-III 的整理子集(领域内)和新的产科数据集(领域外)上的部分分割性能。第三,我们首次对医疗部分分割的监督模型与零样本大型语言模型进行了正面比较。我们的结果表明,虽然监督模型在领域内表现强劲,但在领域外的性能显著下降。相比之下,零样本模型在纠正了幻觉部分标题后显示出强大的领域外适应性。这些发现强调了开发领域特定临床资源的重要性,并突出了零样本分割作为将医疗自然语言处理应用于研究较少的语料库的有希望方向,只要适当地管理幻觉问题。
cs.CL / 36 / 2602.17542

Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

利用大型语言模型进行开放式编码问题中的知识组件级正确性标注
Duan, Zhangqi, Kankaria, Arnav, Kartik, Dhruv, Lan, Andrew
Abstract
Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
Chinese Translation
细粒度技能表示,通常称为知识组件(Knowledge Components, KCs),是学生建模和学习分析中许多方法的基础。然而,现实世界数据集中很少有KC级别的正确性标签,尤其是在开放式编程任务中,解决方案通常涉及多个KC的同时应用。简单地将问题级别的正确性传播到所有相关的KC会掩盖部分掌握情况,并且往往导致学习曲线拟合不良。为了解决这一挑战,我们提出了一个自动化框架,利用大型语言模型(Large Language Models, LLMs)直接从学生编写的代码中标注KC级别的正确性。我们的方法评估每个KC是否被正确应用,并进一步引入了一种时间上下文感知的代码-KC映射机制,以更好地将KC与个别学生的代码对齐。我们使用实践的幂律和加性因素模型评估所得到的KC级别正确性标签在学习曲线拟合和预测性能方面的表现。实验结果表明,与基线相比,我们的框架生成的学习曲线与认知理论更为一致,并提高了预测性能。人类评估进一步表明LLM与专家注释之间存在显著一致性。
cs.CL / 37 / 2602.17546

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

学习保持安全:针对微调过程中安全退化的自适应正则化
Goel, Jyotin, Maji, Souvik, Mazumder, Pratik
Abstract
Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
Chinese Translation
遵循指令的语言模型被训练为有帮助且安全,但它们的安全行为在良性微调下可能会恶化,并在对抗性更新中进一步恶化。现有的防御措施通常提供有限的保护,或者迫使安全性与效用之间进行权衡。我们提出了一种训练框架,该框架根据安全风险自适应调整正则化,使模型在微调过程中保持一致性。为了在训练时估计安全风险,我们探索了两种不同的方法:基于评估者的安全评论员(Safety Critic),为训练批次分配高层次的伤害评分,以及基于激活的风险预测器,该预测器使用轻量级分类器构建,训练于中间模型激活,以估计有害意图。每种方法提供的风险信号用于约束被认为风险较高的更新,使其保持接近安全参考策略,而风险较低的更新则按照标准训练进行。我们实证验证了有害意图信号可以从生成前的激活中预测,并且评估者评分提供了有效的高召回率安全指导。在多个模型系列和攻击场景中,采用任一风险估计方法的自适应正则化相比于标准微调,始终降低了攻击成功率,保持了下游性能,并且没有增加推理时的成本。这项工作展示了一种保持安全而不牺牲效用的原则机制。
cs.CL / 38 / 2602.17588

Modeling Distinct Human Interaction in Web Agents

建模网络代理中的人类互动特征
Huq, Faria, Wang, Zora Zhiruo, Guo, Zhanqiu, Arangarajan, Venu Arvind, Ou, Tianyue, Xu, Frank, Zhou, Shuyan, Neubig, Graham, Bigham, Jeffrey P.
Abstract
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
Chinese Translation
尽管自主网络代理技术迅速发展,但人类的参与仍然是塑造偏好和纠正代理行为的关键,尤其是在任务进行过程中。然而,当前的代理系统缺乏对人类何时以及为何干预的原则性理解,常常在关键决策点上自主推进或请求不必要的确认。在本研究中,我们引入了建模人类干预的任务,以支持协作网络任务的执行。我们收集了CowCorpus数据集,该数据集包含400个真实用户的网络导航轨迹,涵盖了超过4200个交错的人类和代理行为。我们识别出用户与代理互动的四种不同模式——放手监督、亲自监督、协作解决任务和完全用户接管。基于这些洞察,我们训练了语言模型(LMs),以预测用户何时可能干预,依据他们的互动风格,干预预测准确率比基础语言模型提高了61.4%-63.4%。最后,我们在实时网络导航代理中部署了这些关注干预的模型,并在用户研究中对其进行了评估,发现用户对代理有用性的评分提高了26.5%。总的来说,我们的结果表明,结构化的人类干预建模能够导致更具适应性和协作性的代理。
cs.CL / 39 / 2602.17598

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

级联等价假设:何时语音大型语言模型表现得像ASR$ ightarrow$LLM管道?
Billa, Jayadev
Abstract
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($\kappa{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
Chinese Translation
当前的语音大型语言模型(LLMs)在很大程度上执行隐式自动语音识别(ASR):在可通过转录解决的任务中,它们在行为和机制上与简单的Whisper$ o$LLM级联相当。我们通过在四个语音LLM和六个任务中进行匹配骨干测试,首次控制LLM骨干,来证明这一点。Ultravox在统计上与其匹配级联无显著区别($ ext{kappa} = 0.93$);logit lens揭示了隐藏状态中出现的字面文本;LEACE概念消除确认文本表示在测试的两种架构中都是因果必要的,导致准确率降至接近零。Qwen2-Audio则真实地表现出差异,揭示级联等价性依赖于架构,而非普遍适用。对于大多数已部署的使用案例,当前的语音LLMs是昂贵的级联,且在噪声环境下表现更差,在清洁条件下的优势在0 dB时反转高达7.6%。
cs.CL / 40 / 2602.17623

Unmasking the Factual-Conceptual Gap in Persian Language Models

揭示波斯语言模型中的事实-概念差距
Sakhaeirad, Alireza, Ma'manpoosh, Ali, Hemmat, Arshia
Abstract
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
Chinese Translation
尽管新兴的波斯自然语言处理基准已经扩展到语用学和礼貌性,但它们很少区分记忆的文化事实与对隐含社会规范进行推理的能力。我们引入了DivanBench,这是一个专注于迷信和习俗的诊断性基准,涉及那些抵制简单逻辑推导的任意、依赖于上下文的规则。通过315个问题,涵盖三种任务类型(事实检索、配对情境验证和情境推理),我们评估了七个波斯大语言模型,并揭示了三个关键失败:大多数模型表现出严重的顺从偏差,能够正确识别适当的行为,但未能拒绝明显的违规行为;持续的波斯语言预训练加剧了这种偏差,而不是改善推理,往往降低了模型识别矛盾的能力;所有模型在检索事实知识与在情境中应用这些知识之间显示出21%的性能差距。这些发现表明,文化能力不仅仅依赖于单语数据的扩展,因为当前模型学习模仿文化模式,而未能内化其背后的框架。
cs.CL / 41 / 2602.17653

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

语言模型在差异性论元标记处理中的类型对齐差异
Deng, Iskar, Xu, Nathalia, Steinert-Threlkeld, Shane
Abstract
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
Chinese Translation
近期的研究表明,基于合成语料库训练的语言模型(LMs)能够表现出与人类语言中的跨语言规律相似的类型偏好,特别是在诸如词序等句法现象方面。本文将这一范式扩展到差异性论元标记(DAM),这是一种语义许可系统,其中形态标记依赖于语义突出性。我们使用受控的合成学习方法,在18个实施不同DAM系统的语料库上训练GPT-2模型,并通过最小对比对其泛化能力进行评估。我们的结果揭示了DAM的两个类型维度之间的解离。模型可靠地表现出与人类相似的自然标记方向偏好,倾向于选择那些显性标记针对语义上不典型论元的系统。相反,模型并未重现人类语言中强烈的宾语偏好,在DAM中,显性标记更常针对宾语而非主语。这些发现表明,不同的类型倾向可能源于不同的基础来源。
cs.CL / 42 / 2602.17655

What Language is This? Ask Your Tokenizer

这是什么语言?问问你的分词器
Meister, Clara, Yavuz, Ahmetcan, Lesci, Pietro, Pimentel, Tiago
Abstract
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.
Chinese Translation
语言识别(Language Identification, LID)是许多多语言自然语言处理管道中的一个重要组成部分,它有助于语料库的整理、训练数据的分析以及大型语言模型的跨语言评估。尽管在高资源语言上表现近乎完美,但现有系统在低资源和密切相关语言环境中仍然脆弱。我们提出了UniLID,这是一种基于UnigramLM分词算法的简单高效的LID方法,利用其概率框架、参数估计技术和推理策略。简而言之,我们学习在共享的分词器词汇上条件于语言的单元分布,但将分段视为一种特定于语言的现象。我们的公式在数据和计算上高效,支持在不重新训练现有模型的情况下增量添加新语言,并且可以自然地集成到现有的语言模型分词管道中。与广泛使用的基准(包括fastText、GlotLID和CLD3)进行的实证评估表明,UniLID在标准基准上实现了具有竞争力的性能,在低资源环境中显著提高了样本效率——在每种语言仅有五个标记样本的情况下,准确率超过70%——并在细粒度方言识别上取得了显著提升。
cs.CL / 43 / 2602.17664

Sink-Aware Pruning for Diffusion Language Models

面向汇聚的扩散语言模型剪枝
Myrzakhan, Aidar, Li, Tianyi, Guo, Bowei, Tang, Shengkun, Shen, Zhiqiang
Abstract
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.
Chinese Translation
扩散语言模型(DLMs)由于迭代去噪而产生高昂的推理成本,这促使了高效剪枝的需求。现有的剪枝启发式方法主要继承自自回归(AR)大语言模型(LLMs),通常保留注意力汇聚标记,因为AR汇聚作为稳定的全局锚点。我们表明这一假设在DLMs中并不成立:注意力汇聚位置在整个生成轨迹上表现出显著更高的方差(通过主导汇聚位置在时间步长之间的变化来衡量),这表明汇聚通常是短暂的,并且在结构上不如AR模型中的汇聚那样重要。基于这一观察,我们提出了${f exttt{面向汇聚的剪枝}}$,该方法自动识别并剪枝DLMs中的不稳定汇聚(以往研究通常为AR LLMs保留汇聚)。在不进行重训练的情况下,我们的方法实现了更好的质量-效率权衡,并在匹配计算条件下超越了强大的先前剪枝基线。我们的代码可在https://github.com/VILA-Lab/Sink-Aware-Pruning获取。