← Back to Index
Daily Research Digest

arXiv Papers

2026-04-27
145
Papers
4
Categories
145
Translated
收藏清单 0
机器人学 (Robotics)
21
cs.RO / 1 / 2604.22040

Robust Localization for Autonomous Vehicles in Highway Scenes

高速公路场景下自主车辆的鲁棒定位
Cheng, Daqian, Ding, Xuchu, Wu, Yujia, Zhang, Xiang, Wang, Lei
Abstract
Localization for autonomous vehicles on highways remains under-explored compared to urban roads, and state-of-the-art methods for urban scenes degrade when directly applied to highways. We identify key challenges including environment changes under information homogeneity, heavy occlusion, degraded GNSS signals, and stringent downstream requirements on accuracy and latency. We propose a robust localization system to address highway challenges, which uses a dual-likelihood LiDAR front end that decouples 3D geometric structures and 2D road-texture cues to handle environment changes; a Control-EKF further leverages steering and acceleration commands to reduce lag and improve closed-loop behavior. An automated offline mapping and ground-truth pipeline keep maps fresh at high cadence for optimal localization performance. To catalyze progress, we release a public dataset covering both urban roads and highways while focusing on representative challenging highway clips, totaling 163 km; benchmarking is standardized using product-oriented accuracy metrics and certified ground truth. Compared to Apollo and Autoware, our system performs similarly on urban roads but shows superior robustness on challenging highway scenarios. The system has been validated by more than one million kilometers of road testing.
Chinese Translation
与城市道路相比,高速公路上自主车辆的定位研究仍然未得到充分探索,当前针对城市场景的最先进方法在直接应用于高速公路时效果降低。我们识别出关键挑战,包括信息同质性下的环境变化、严重遮挡、GNSS信号衰退,以及对准确性和延迟的严格下游要求。为应对高速公路的挑战,我们提出了一种鲁棒定位系统,该系统利用双似然激光雷达(LiDAR)前端,将三维几何结构与二维道路纹理线索解耦,以应对环境变化;控制扩展卡尔曼滤波(Control-EKF)进一步利用转向和加速度指令以减少延迟并改善闭环行为。一个自动化离线映射和真实值获取管道可高频率保持地图的更新,为最佳定位性能提供支持。为了促进进展,我们发布了一个公开数据集,覆盖城市道路和高速公路,并集中在具有代表性的具有挑战性的高速公路片段,总长度达到163公里;基准测试采用产品导向的准确性指标和经认证的真实值。与Apollo和Autoware相比,我们的系统在城市道路上的表现相似,但在具有挑战性的高速公路场景中显示出更卓越的鲁棒性。该系统已通过超过一百万辆公里的道路测试进行验证。
cs.RO / 2 / 2604.22065

SNGR: Selective Non-Gaussian Refinement for Ambiguous SLAM Factor Graphs

SNGR:针对模糊SLAM因子图的选择性非高斯细化
Kulkarni, Anushka, Dubey, Sarthak
Abstract
We present Selective Non-Gaussian Refinement (SNGR), a SLAM framework that augments iSAM2 with targeted nested sampling on windows where Gaussian approximations are likely to fail. We detect such regions using the condition number of joint marginal covariances and selectively refine them using the full nonlinear factor graph likelihood, with a gating mechanism to avoid degradation in multimodal cases. Experiments on range-only SLAM with wrong data association show that SNGR achieves high-precision failure detection and consistent local likelihood improvements while reducing computational cost relative to exhaustive non-Gaussian inference. These results highlight both the promise and the limitations of selective refinement for approximate SLAM posteriors.
Chinese Translation
我们提出了选择性非高斯细化(SNGR),这是一种增强iSAM2的SLAM框架,通过在高斯近似可能失败的窗口上进行针对性的嵌套采样来实现。我们使用联合边际协方差的条件数检测此类区域,并利用全非线性因子图似然进行选择性细化,同时采用门控机制以避免在多模态情况下的性能下降。在仅使用范围数据的SLAM实验中,尽管存在错误的数据关联,SNGR在故障检测的高精度和局部似然一致性提升方面表现优异,同时相较于全面的非高斯推断减少了计算成本。这些结果突显了选择性细化在近似SLAM后验中的潜力与局限性。
cs.RO / 3 / 2604.22102

Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

扭动并出发!零样本动态绳索操作的系统识别
Jakobsson, Arthur, Mahajan, Abhinav, Pullalarevu, Karthik, Suresh, Krishna, Yao, Yunchao, Mao, Yuemin, Duisterhof, Bardienus, Syed, Shahram Najam, Ichnowski, Jeffrey
Abstract
Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure. To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution. Related methods for dynamic rope manipulation either require large real-world datasets to estimate rope behavior or the use of iterative improvements on attempts at the task for goal completion. We introduce Wiggle and Go!, a system-identification, two-stage framework that enables zero-shot task rope manipulation. The framework consists of a system identification module that observes rope movement to predict descriptive physical parameters, which then informs an optimization method for goal-conditioned action prediction for the robot to execute zero-shot in the real. Our method achieves strong performance across multiple dynamic manipulation tasks enabled by the same task-agnostic system identification module which offers seamless switching between different manipulation tasks, allowing a single model to support a diverse array of manipulation policies. We achieve a 3.55 cm average accuracy on 3D target striking in real using rope system parameters in comparison to 15.34 cm accuracy when our task model is not system-parameter-informed. We achieve a Pearson correlation coefficient of 0.95 between Fourier frequencies of the predicted and real ropes on an unseen trajectory. Project website please see https://wiggleandgo.github.io/
Chinese Translation
许多机器人任务对错误不宽容;在动态投掷中,一次失误可能导致不可接受的延迟或无法恢复的失败。为了解决这一问题,我们提出了一种新方法,利用学习到的模拟先验来指导绳索的目标条件动态操作,以实现高效和准确的任务执行。相关的动态绳索操作方法通常需要大量的真实世界数据集来估计绳索行为,或者需要对任务进行反复改进以完成目标。我们引入了扭动并出发!一种系统识别的两阶段框架,能够实现零样本任务绳索操作。该框架包括一个系统识别模块,它观察绳索运动以预测描述性物理参数,进而为优化方法提供信息,以进行目标条件的行动预测,从而使机器人在实际中实现零样本操作。我们的方法在多个动态操作任务中表现出色,得益于同一无任务特定的系统识别模块,能够在不同的操作任务之间无缝切换,使单一模型能够支持多样化的操作策略。在使用绳索系统参数的真实3D目标打击中,我们的平均准确度达到3.55厘米,而在没有系统参数信息时我们的任务模型的准确度仅为15.34厘米。在一个未见轨迹上,预测绳索和真实绳索的傅里叶频率之间的皮尔逊相关系数达到0.95。项目网站请见 https://wiggleandgo.github.io/
cs.RO / 4 / 2604.22104

Dynamic Coupling and Indirect Control of Jointed Robots Rolling Atop A Moving Platform

动态耦合及间接控制滚动在移动平台上的关节机器人
Moradi, Hamidreza, Kelly, Scott David
Abstract
An asymmetric two-link robot supported atop a flat platform by wheels that roll and pivot freely, but do not slip laterally, will develop forward momentum if the joint between the links is actuated internally. In particular, oscillations in the joint angle will generate undulatory locomotion suggesting fishlike swimming. If two such robots surmount a common platform that's free to translate with its own inertial dynamics, then the individual robots' dynamics will be coupled so that the locomotion of either robot is affected by that of the other. We develop a mathematical model for this system and present simulations demonstrating its behavior. We then consider a single robot with an unactuated joint rolling atop a platform that moves under control, and show that actuation of the platform is sufficient to dictate the robot's behavior. In particular, with the acceleration of the platform as an input, the robot's heading can be made to track a chosen function of time. This is sufficient to guarantee that the robot can be induced to orbit a fixed point on the platform or to locomote persistently in a desired direction.
Chinese Translation
一种不对称的双链机器人,通过自由滚动和枢转而不横向滑动的轮子支撑在一个平坦的平台上,如果链节之间的关节内部激活,将会产生向前的动量。特别是,关节角度的振荡将生成如鱼类游泳一般的波状运动。如果两个这样的机器人共同置于一个具有自身惯性动力学的自由移动平台上,它们的动态将相互耦合,从而使得任一机器人的运动受到另一机器人的影响。我们为该系统开发了数学模型,并展示了其行为的仿真结果。接着,我们考虑一个具有未激活关节的单个机器人,在一个受控移动的平台上滚动,并表明平台的激活足以决定机器人的行为。特别是,利用平台的加速度作为输入,机器人的航向可以追踪一个选定的时间函数。这足以保证机器人能够被诱导绕平台上的固定点轨道运动或持续朝向所需方向移动。
cs.RO / 5 / 2604.22152

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

dWorldEval:通过离散扩散世界模型进行可扩展的机器人策略评估
Li, Yaxuan, Zhou, Zhongyi, Chen, Yefei, Xue, Yaokai, Zhu, Yichen
Abstract
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.
Chinese Translation
在现有的方法中,评估成千上万环境和任务下的机器人策略是不可行的。这促使我们需要一种新的方法来实现可扩展的机器人策略评估。本文提出了dWorldEval,它使用离散扩散世界模型作为机器人策略的可扩展评估代理。具体而言,dWorldEval将所有模态——包括视觉、语言和机器人动作——映射到一个统一的token空间,通过一个基于变换器的去噪网络对它们进行建模。基于这一架构,我们采用稀疏关键帧内存以保持时空一致性。我们还引入了一个进度token,指示任务完成的程度。在推理阶段,模型共同预测未来的观察结果和进度token,从而在进度达到1时自动判断成功。大量实验证明,dWorldEval在LIBERO、RoboTwin和多个真实机器人任务中显著优于先前的方法,即WorldEval、Ctrl-World和WorldGym。这为在大规模机器人评估中构建世界模拟器开辟了一条新的架构范式。
cs.RO / 6 / 2604.22189

Energy-Efficient Multi-Robot Coverage Path Planning of Non-Convex Regions of Interests

非凸区域兴趣的节能多机器人覆盖路径规划
Raxit, Sourav, Fuentes, Jose, Padrao, Paulo, Newaz, Abdullah Al Redwan, Hoque, Md Tamjidul, Kulp, Mark, Bobadilla, Leonardo
Abstract
This letter presents an energy-efficient multi-robot coverage path planning (MRCPP) framework for large, nonconvex Regions of Interest (ROI) containing obstacles and no-fly zones (NFZ). Existing minimum-energy coverage planning algorithms utilize meta-heuristic boustrophedon workspace decomposition. Therefore, even with minimum energy objectives and energy consumption constraints, they cannot achieve optimal energy efficiency. Moreover, most existing frameworks support only a single type of robotic platform. MRCPP overcomes these limitations by: generating globally-informed swath generation, creating parallel sweeping paths with minimal turns, calculating safety buffers to ensure safe turning clearance, using an efficient mTSP solver to balance workloads and minimize mission time, and connecting disjoint segments via a modified visibility graph that tracks heading angles while maintaining transitions within safe regions. The efficacy of the proposed MRCPP framework is demonstrated through real-world experiments involving autonomous aerial vehicles (AAVs) and autonomous surface vehicles (ASVs). Evaluations demonstrate that the proposed MRCPP consistently outperforms state-of-the-art planners, reducing average total energy consumption by 3\% to 40\% for a team of 3 robots and computation time by an order of magnitude, while maintaining balanced workload distribution and strong scalability across increasing fleet sizes. The MRCPP framework is released as an open-source package and videos of real-world and simulated experiments are available at https://mrc-pp.github.io.
Chinese Translation
本文提出了一种针对包含障碍物和禁飞区(NFZ)的非凸区域兴趣(ROI)的节能多机器人覆盖路径规划(MRCPP)框架。现有的最小能量覆盖规划算法利用元启发式的反捡方法工作空间分解。因此,即使在最小能量目标和能耗约束下,它们也无法实现最佳能量效率。此外,大多数现有框架仅支持单一类型的机器人平台。MRCPP通过以下方式克服了这些限制:生成全局信息的扫荡生成,创建最小转弯的并行扫掠路径,计算安全缓冲区以确保安全的转向间隙,使用高效的 mTSP 求解器来平衡工作负载并最小化任务时间,并通过修改的可见性图连接不相交的段,在保持过渡在安全区域内的同时跟踪航向角。通过涉及自主航空器(AAV)和自主水面船舶(ASV)的实际实验展示了所提出的 MRCPP 框架的有效性。评估表明,所提出的 MRCPP 一直优于最先进的规划者,对于3台机器人的团队将平均总能耗减少了3 ext{ extperthousand}至40 ext{ extperthousand},并将计算时间缩短了一个数量级,同时保持工作负载的均衡分配和随着舰队规模增加的强可扩展性。MRCPP框架以开源包的形式发布,实际和模拟实验的视频可以访问 https://mrc-pp.github.io。
cs.RO / 7 / 2604.22196

V-STC: A Time-Efficient Multi-Vehicle Coordinated Trajectory Planning Approach

V-STC:一种时间高效的多车辆协调轨迹规划方法
Liu, Pengfei, Zhou, Jialing, Lv, Yuezu, Wen, Guanghui, Huang, Tingwen
Abstract
Coordinating the motions of multiple autonomous vehicles (AVs) requires planning frameworks that ensure safety while making efficient use of space and time. This paper presents a new approach, termed variable-time-step spatio-temporal corridor (V-STC), that enhances the temporal efficiency of multi-vehicle coordination. An optimization model is formulated to construct a V-STC for each AV, in which both the spatial configuration of the corridor cubes and their time durations are treated as decision variables. By allowing the corridor's spatial position and time step to vary, the constructed V-STC reduces the overall temporal occupancy of each AV while maintaining collision-free separation in the spatio-temporal domain. Based on the generated V-STC, a dynamically feasible trajectory is then planned independently for each AV. Simulation studies demonstrate that the proposed method achieves safe multi-vehicle coordination and yields more time-efficient motion compared with existing STC approaches.
Chinese Translation
协调多辆自主车辆(AV)的运动需要确保安全同时高效利用空间和时间的规划框架。本文提出了一种新的方法,称为可变时间步长时空走廊(Variable-Time-Step Spatio-Temporal Corridor,V-STC),旨在增强多车辆协调的时间效率。我们构建了一个优化模型,为每辆AV构造V-STC,其中走廊立方体的空间配置和时间持续时间都被视为决策变量。通过允许走廊的空间位置和时间步长变化,构建的V-STC在保持时空域中无碰撞分离的同时,减少了每辆AV的总体时间占用。基于生成的V-STC,接着为每辆AV独立规划一个动态可行的轨迹。仿真研究表明,所提出的方法实现了安全的多车辆协调,并与现有的STC方法相比,产生了更为时间高效的运动。
cs.RO / 8 / 2604.22199

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

面向开放环境中未覆盖任务的基于大语言模型驱动的闭环自主学习框架
Su, Hong
Abstract
Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.
Chinese Translation
在开放环境中自主运行的机器人需要具备持续处理未被预定义本地方法覆盖的任务的能力。然而,现有的方法常常依赖于对未覆盖任务的反复大语言模型(LLM)交互,即使成功的执行或观察到的成功外部行为也并不总是能够被自主转化为可重复使用的本地知识。本文提出了一种面向开放环境中未覆盖任务的基于LLM驱动的闭环自主学习框架。该框架首先检索本地方法库,以确定当前任务或观察事件是否已经存在可重复使用的解决方案。如果找不到合适的方法,它将触发一个自主学习过程,在该过程中,LLM作为任务分析、高层次推理组件,进行候选模型选择、数据收集规划以及执行或观察策略的组织。机器人随后从自我执行和主动观察中学习,进行准实时的训练和调整,并将验证结果整合到本地方法库中以供将来重用。通过这一重复的闭环过程,机器人逐步将执行派生和观察派生的经验转化为可重复使用的本地能力,同时减少未来对反复外部LLM交互的依赖。结果表明,所提框架在重复任务自执行和观察驱动设置中都减少了执行时间和对LLM的依赖,例如在重复任务自执行实验中,平均总执行时间从7.7772秒减少到6.7779秒,平均每个任务的LLM调用次数从1.0减少到0.2。
cs.RO / 9 / 2604.22235

Learning-augmented robotic automation for real-world manufacturing

增强学习的机器人自动化在现实制造中的应用
Kim, Yunho, Nguyen, Quan, Kim, Taewhan, Heo, Youngjin, Lee, Joonho
Abstract
Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes. Learning-based control offers a more adaptive alternative, but it remains unclear whether such methods, still mostly confined to laboratory demonstrations, can sustain hours of reliable operation, deliver consistent quality, and behave safely around people on a live production line. Here we present Learning-Augmented Robotic Automation, a hybrid system that integrates learned task controllers and a neural 3D safety monitor into conventional industrial workflows. We deployed the system on an electric-motor production line to automate deformable cable insertion and soldering under real manufacturing constraints, a step previously performed manually by human workers. With less than 20 min of real-world data per task, the system operated continuously for 5 h 10 min, producing 108 motors without physical fencing and achieving a 99.4% pass rate on product-level quality-control tests. It maintained near-human takt time while reducing variability in solder-joint quality and cycle time. These results establish a practical pathway for extending industrial automation with learning-based methods.
Chinese Translation
工业机器人在制造业得到了广泛应用,但大多数操作仍然依赖于固定的路径点脚本,这些脚本在环境变化时表现出脆弱性。基于学习的控制方法提供了更具适应性的替代方案,但尚不清楚这些方法是否能够在实际生产线上实现数小时的可靠操作,交付一致的质量,并在与人合作时保持安全。在此,我们提出了增强学习机器人自动化(Learning-Augmented Robotic Automation),这是一种混合系统,将学习型任务控制器与神经3D安全监控器集成到传统工业工作流程中。我们在电动机生产线上部署了该系统,以自动化在真实制造约束下的可变形电缆插入和焊接,这一步骤之前由人工完成。每项任务仅需不到20分钟的真实数据,该系统连续操作了5小时10分钟,生产了108个电动机,在没有物理围栏的情况下实现了99.4%的产品质量控制测试合格率。它保持了接近人类的节拍时间,同时减少了焊点质量和周期时间的变异性。这些结果为利用基于学习的方法扩展工业自动化建立了一个实用的途径。
cs.RO / 10 / 2604.22238

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

CodeGraphVLP: 将代码作为规划者与语义图状态结合用于非马尔可夫视-语言-行动模型
Vo, Khoa, Tran, Sieu, Hanyu, Taisei, Ikebe, Yuki, Nguyen, Duy, Nghi, Bui Duy Quoc, Vu, Minh, Gunderman, Anthony, Rainwater, Chase, Nguyen, Anh, Le, Ngan
Abstract
Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.
Chinese Translation
视-语言-行动(VLA)模型承诺实现通用机器人操作,但通常被训练和部署为短期策略,假设最新的观察结果足以进行行动推理。在非马尔可夫的长期任务中,这一假设失效,因为任务相关的证据可能会被遮挡或仅在轨迹的早期出现,而杂物和干扰物使得精细的视觉定位变得脆弱。我们提出了CodeGraphVLP,这是一种分层框架,通过将持久的语义图状态与可执行的基于代码的规划器和进度引导的视-语言提示结合,实现可靠的长期操作。语义图在部分可观察条件下维护任务相关实体和关系。合成的规划器在该语义图上执行,以进行有效的进度检查,并输出子任务指令以及与子任务相关的对象。我们使用这些输出构建抑制杂物的观察结果,以使VLA执行器集中于关键证据。在真实世界的非马尔可夫任务中,CodeGraphVLP在任务完成率上超越了强大的VLA基线和历史支持的变体,同时相比于循环中的VLM规划显著降低了规划延迟。我们还进行了广泛的消融研究,以确认每个组件的贡献。
cs.RO / 11 / 2604.22244

Learning Control Policies to Provably Satisfy Hard Affine Constraints for Black-Box Hybrid Dynamical Systems

学习控制策略以可靠满足黑箱混合动态系统的严格仿射约束
Shrivastava, Aayushi, Nagpal, Kartik, Jinkala, Sairam, Bouvier, Jean-Baptiste, Mehr, Negar
Abstract
Ensuring safety for black-box hybrid dynamical systems presents significant challenges due to their instantaneous state jumps and unknown explicit nonlinear dynamics. Existing solutions for strict safety constraint satisfaction, like control barrier functions (CBFs) and reachability analysis, rely on direct knowledge of the dynamics. Similarly, safe reinforcement learning (RL) approaches often rely on known system dynamics or merely discourage safety violations through reward shaping. In this work, we want to learn RL policies which provably satisfy affine state constraints in closed loop for black-box hybrid dynamical systems with affine reset maps. Our key insight is forcing the RL policy to be affine and repulsive near the constraint boundaries for the unknown nonlinear dynamics of the system, providing guarantees that the trajectories will not violate the constraint. We further account for constraint violation due to instantaneous state jumps that occur due to impacts or reset maps in the hybrid system by introducing a second repulsive affine region before the reset that prevents post-reset states from violating the constraint. We derive sufficient conditions under which these policies satisfy safety constraints in closed loop. We also compare our approach with state-of-the-art reward shaping and learned-CBF methods on hybrid dynamical systems like the constrained pendulum and paddle juggler environments. In both scenarios, we show that our methodology learns higher quality policies while always satisfying the safety constraints.
Chinese Translation
确保黑箱混合动态系统的安全性面临重大挑战,原因在于其瞬时状态跳跃和未知的显式非线性动态。现有的严格安全约束满足方案,如控制障碍函数(Control Barrier Functions, CBFs)和可达性分析,依赖于对动态的直接知识。同样,安全强化学习(Reinforcement Learning, RL)方法往往依赖于已知的系统动态,或者仅通过奖励塑形来抑制安全违例。在本工作中,我们希望学习强化学习策略,能够在闭环中可靠地满足具有仿射重置映射的黑箱混合动态系统的仿射状态约束。我们的关键见解是强制强化学习策略在约束边界附近为仿射且具有排斥性,从而提供轨迹不会违反约束的保证。我们进一步考虑由于冲击或混合系统中的重置映射导致的瞬时状态跳跃所引发的约束违例,通过引入第二个排斥性仿射区域,在重置之前防止后重置状态违反约束。我们推导出这些策略在闭环中满足安全约束的充分条件。我们还将我们的方法与现有的奖励塑形和学习控制障碍函数(Learned CBF)方法在受限摆和拍球者环境等混合动态系统上进行比较。在这两种情况下,我们都表明我们的方法学习到的策略质量更高,同时始终满足安全约束。
cs.RO / 12 / 2604.22251

False Feasibility in Variable Impedance MPC for Legged Locomotion

腿部运动中变阻抗模型预测控制的虚假可行性
Ramesh, Vishal
Abstract
Variable impedance model predictive control (MPC) formulations that treat joint stiffness as an instantaneous decision variable operate on a feasible set strictly larger than the physically realizable set under first-order actuator dynamics. We identify this as a formulation error rather than a modeling approximation, formalize the distinction between the parameter-based feasible set Fparam and the realizable set Freal, and characterize the regime of mismatch via the dimensionless parameter alpha = omega_sT (actuator bandwidth times task timescale). For the 1D hopping monoped, we prove that below an analytical threshold alpha_crit derived in closed form from task physics, no admissible stiffness command realizes the parameter-based prediction. Numerical validation in 1D shows monotonic deviation growth as alpha decreases, with the predicted scaling holding across ten parameter combinations (log-log R2 = 0.99). Mechanism transfer to planar spring-loaded inverted pendulum dynamics confirms center-of-mass and stance-timing deviation as the primary consequence, with regime-dependent friction effects as a tertiary observable. A second threshold alpha_infeas < alpha_crit establishes a floor below which restricting the admissible stiffness range cannot repair realizability, closing the conservative-tuning objection on structural grounds. Augmenting the prediction state with stiffness closes the mismatch by construction.
Chinese Translation
将关节刚度视为瞬时决策变量的变阻抗模型预测控制(MPC)公式操作于一个严格大于在一阶驱动器动态下物理可实现集合的可行集合。我们将其识别为一种公式错误,而非建模近似,明确参数基础可行集合 Fparam 与可实现集合 Freal 之间的区别,并通过无量纲参数 alpha = omega_sT(驱动器带宽与任务时间尺度的乘积)表征不匹配的状态。对于一维跳跃单足机器人,我们证明在从任务物理中推导出的解析阈值 alpha_crit 以下,没有可接受的刚度指令能够实现基于参数的预测。数值验证显示,当 alpha 减小时,单维跳跃时的偏差单调增长,且在十种参数组合下预测的尺度保持一致(对数-对数 R2 = 0.99)。机制转移到平面弹簧加载的倒立摆动态上证实,重心和支撑时机的偏差是主要后果,而与状态相关的摩擦效应则为三级可观察量。第二个阈值 alpha_infe < alpha_crit 确立了一个底线,在该底线以下限制可接受的刚度范围无法修复可实现性,从结构前提上消除了保守调优的异议。通过将刚度增加到预测状态中,从构造上消除了不匹配。
cs.RO / 13 / 2604.22283

A Kinematic Analysis of Palm Degrees of Freedom for Enhancing Thumb Opposability in Robotic Hands

提高机器人手拇指对立性的掌部自由度运动学分析
Kang, HyoJae, Park, Yeong Jae, Jung, Hyunmok, Lee, Joonho, Park, Dong Il
Abstract
This study investigates the kinematic role of palm degrees of freedom (DoF) in enhancing thumb opposability in a five-finger robotic hand. A hand model consisting of a five DoF thumb and four fingers with three to four DoF is analyzed, where palm motion is introduced between adjacent fingers. To quantitatively evaluate thumb-finger interaction, the overlap workspace volume is defined based on voxelized fingertip reachable regions. Seven cases are considered, including configurations with increased total DoF and configurations in which the total DoF is maintained by redistributing DoF from the fingers to the palm. The results show that palm DoF significantly improves opposability, particularly for the ring and little fingers, by repositioning their base locations rather than simply extending their reachable range. However, when the total DoF is constrained, redistributing DoF to the palm leads to trade-offs between overlap workspace expansion and kinematic redundancy. These findings indicate that palm DoF and finger DoF play distinct roles in hand kinematics and should be considered jointly in design. This study provides a quantitative framework for evaluating palm-induced opposability without relying on object or contact models and offers practical design guidelines for incorporating palm motion in robotic hands.
Chinese Translation
本研究探讨了掌部自由度(DoF)在增强五指机器人手拇指对立性方面的运动学作用。分析了一个包含五个自由度拇指和四个具有三到四个自由度手指的手模型,在相邻手指之间引入掌部运动。为了定量评估拇指与手指的相互作用,定义了基于体素化的指尖可达区域的重叠工作空间体积。考虑了七种情况,包括总自由度增加的配置以及通过将手指的自由度重新分配到掌部以维持总自由度的配置。结果表明,掌部自由度显著改善了对立性,特别是对于无名指和小指,通过重新定位它们的基底位置而不仅仅是扩展其可达范围。然而,当总自由度受到限制时,将自由度重新分配到掌部会导致重叠工作空间扩展与运动学冗余之间的权衡。这些发现表明,掌部自由度和手指自由度在手的运动学中发挥着不同的作用,应在设计中共同考虑。本研究提供了一种定量框架,用于评估掌部引起的对立性,而无需依赖于物体或接触模型,并为在机器人手中整合掌部运动提供了实际设计指导。
cs.RO / 14 / 2604.22363

LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios

LeHome:家庭场景下可变形物体操作的模拟环境
Li, Zeyi, Yang, Yushi, Xie, Shawn, Xu, Kyle, Chen, Tianxing, Wang, Yuran, Shen, Zhenhao, Shen, Yan, Chen, Yue, Li, Wenjun, Zheng, Yukun, Zhang, Chaorui, Lin, Siyi, Teng, Fei, Yang, Hongjun, Chen, Ming, Xie, Steve, Wu, Ruihai
Abstract
Household environments present one of the most common, impactful yet challenging application domains for robotics. Within household scenarios, manipulating deformable objects is particularly difficult, both in simulation and real-world execution, due to varied categories and shapes, complex dynamics, and diverse material properties, as well as the lack of reliable deformable-object support in existing simulations. We introduce LeHome, a comprehensive simulation environment designed for deformable object manipulation in household scenarios. LeHome covers a wide spectrum of deformable objects, such as garments and food items, offering high-fidelity dynamics and realistic interactions that existing simulators struggle to simulate accurately. Moreover, LeHome supports multiple robotic embodiments and emphasizes low-cost robots as a core focus, enabling end-to-end evaluation of household tasks on resource-constrained hardware. By bridging the gap between realistic deformable object simulation and practical robotic platforms, LeHome provides a scalable testbed for advancing household robotics. Webpage: https://lehome-web.github.io/ .
Chinese Translation
家庭环境是机器人技术应用中最常见、影响深远但又富有挑战性的领域之一。在家庭场景中,由于物体类别与形状多样、动态复杂、材料属性各异,以及现有模拟环境中缺乏可靠的可变形物体支持,操控可变形物体尤为困难。我们介绍了LeHome,一个专为家庭场景下可变形物体操作设计的综合模拟环境。LeHome涵盖了广泛的可变形物体,如服装和食品,提供高保真度的动态模拟和现实互动,这是现有模拟器难以准确模拟的。此外,LeHome支持多种机器人形态,并将低成本机器人作为核心重点,使得在资源受限的硬件上能够对家庭任务进行端到端的评估。通过弥合现实可变形物体模拟与实用机器人平台之间的差距,LeHome为推动家庭机器人技术提供了一个可扩展的测试平台。网页链接:https://lehome-web.github.io/
cs.RO / 15 / 2604.22378

Adaptive vs. Static Robot-to-Human Handover: A Study on Orientation and Approach Direction

自适应与静态人机交互传递:关于朝向和接近方向的研究
Biagi, Federico, Onfiani, Dario, Silenzi, Simone, Iani, Cristina, Biagiotti, Luigi
Abstract
Robot-to-human handovers often rely on static, open-loop strategies (or, at best, approaches that adapt only the position), which generally do not consider how the object will be grasped by the human, thus requiring the user to adapt. This work presents a novel adaptive framework that dynamically adjusts the object's delivery pose in real time based on the user's hand pose and the intended downstream task. By integrating AI-based hand pose estimation with smooth, kinematically constrained trajectories, the system ensures a safe approach and an optimal handover orientation. A comprehensive user study compares the proposed adaptive approach against a static baseline across multiple tasks, evaluating both subjective metrics (NASA-TLX, Human-Robot Trust Scale) and objective physiological data (blink rate measured via wearable eye-trackers). The results demonstrate that dynamic alignment significantly reduces users' cognitive workload and physiological stress, while increasing perceived trust in the robot's reliability. These findings highlight the potential of task- and pose-aware systems for enabling fluid and ergonomic human-robot collaboration.
Chinese Translation
人机交互传递通常依赖于静态的开环策略(或者说,仅在位置上进行微调的策略),这一般不会考虑人类如何抓取物体,从而要求用户进行适应。本研究提出了一种新颖的自适应框架,该框架基于用户的手部姿态和预期的下游任务动态地实时调整物体的递送姿态。通过将基于人工智能的手部姿态估计与平滑的运动学约束轨迹相结合,该系统确保了安全的接近和最佳的传递朝向。一项全面的用户研究将所提出的自适应方法与多个任务中的静态基线进行了比较,评估了主观指标(NASA-TLX、人与机器人信任量表)和客观生理数据(通过可穿戴眼动仪测量的眨眼频率)。结果表明,动态对齐显著降低了用户的认知工作负担和生理压力,同时增加了对机器人可靠性的感知信任。这些发现突显了任务和姿态感知系统在促进流畅和符合人体工程学的人机协作中的潜力。
cs.RO / 16 / 2604.22526

Information-Theoretic Geometry Optimization and Physics-Aware Learning for Calibration-Free Magnetic Localization

无标定磁定位的信息论几何优化与物理感知学习
Xie, Wenxuan, Zhang, Yuelin, Ding, Qingpeng, Chen, Jianghua, Tan, Jiewen, Shan, Jiwei, Cheng, Shing Shin
Abstract
Wireless localization of permanent magnets enables occlusion-free guidance for medical interventions, yet its practical accuracy is fundamentally limited by two coupled challenges: the poor observability of conventional planar sensor arrays and the simulation-to-reality (Sim-to-Real) gap of learning-based estimators. To address these issues, this article presents a unified framework that combines information-theoretic sensor geometry optimization with physics-aware deep learning. First, a rigorous Fisher Information Matrix (FIM)-based evaluation framework is established to quantify geometry-induced observability limitations. The results show that a staggered split-array topology provides a substantially stronger observability foundation for localization while remaining compatible with practical external deployment. Second, building on this optimized sensing configuration, we propose Phy-GAANet, a calibration-free estimator trained entirely on hardware-aware synthetic data. By incorporating Physics-Informed Features (PIF) for saturation modeling and Geometry-Aware Attention (GAA) for preserving cross-layer vector structure, the network effectively bridges the Sim-to-Real gap. Extensive real-world experiments demonstrate state-of-the-art performance, achieving a position error of 1.84 mm and an orientation error of 3.18 degrees at a refresh rate exceeding 270 Hz. The proposed method consistently outperforms classical Levenberg--Marquardt solvers and generic convolutional baselines, particularly in suppressing catastrophic outliers and maintaining robustness in challenging near-field boundary regions. Beyond the proposed network, the FIM-guided analysis also provides a framework for sensor geometry design in magnetic localization systems under practical deployment constraints.
Chinese Translation
无线定位永久磁体能够实现无遮挡的医疗干预引导,但其实际精度受到两个耦合挑战的根本限制:传统平面传感器阵列的可观测性差和基于学习的估计器的仿真与现实(Sim-to-Real)差距。为了解决这些问题,本文提出了一个统一框架,将信息论传感器几何优化与物理感知深度学习相结合。首先,建立了一个严谨的基于费舍尔信息矩阵(FIM)的评估框架,用以量化几何引起的可观测性限制。结果表明,错列分割阵列拓扑为定位提供了显著更强的可观测性基础,同时与实际外部部署兼容。其次,在此优化传感配置的基础上,我们提出了Phy-GAANet,这是一种完全基于硬件感知的合成数据训练的无标定估计器。通过引入物理信号特征(PIF)以进行饱和建模,并利用几何感知注意力(GAA)来保留跨层向量结构,网络有效地弥合了Sim-to-Real差距。大规模的真实世界实验展示了其最先进的性能,实现了1.84毫米的位置误差和3.18度的方向误差,刷新率超过270赫兹。所提方法在抑制严重离群值及维持在挑战性近场边界区域的鲁棒性方面,始终优于经典的列文伯格-马夸特求解器和通用卷积基准。超越所提出的网络,基于FIM的分析还为在实际部署约束下的磁定位系统中的传感器几何设计提供了框架。
cs.RO / 17 / 2604.22551

QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation

QDTraj: 多样化轨迹原语在关节物体机器人操控中的探索
Kappel, Mathilde, Khoramshahi, Mahdi, Annabi, Louis, Amar, Faiz Ben, Doncieux, Stéphane
Abstract
Thanks to the latest advances in learning and robotics, domestic robots are beginning to enter homes, aiming to execute household chores autonomously. However, robots still struggle to perform autonomous manipulation tasks in open-ended environments. In this context, this paper presents a method that enables a robot to manipulate a wide spectrum of articulated objects. In this paper, we automatically generate different robot low-level trajectory primitives to manipulate given object articulations. A very important point when it comes to generating expert trajectories is to consider the diversity of solutions to achieve the same goal. Indeed, knowing diverse low-level primitives to accomplish the same task enables the robot to choose the optimal solution in its real-world environment, with live constraints and unexpected changes. To do so, we propose a method based on Quality-Diversity algorithms that leverages sparse reward exploration in order to generate a set of diverse and high-performing trajectory primitives for a given manipulation task. We validated our method, QDTraj, by generating diverse trajectories in simulation and deploying them in the real world. QDTraj generates at least 5 times more diverse trajectories for both hinge and slider activation tasks, outperforming the other methods we compared against. We assessed the generalization of our method over 30 articulations of the PartNetMobility articulated object dataset, with an average of 704 different trajectories by task. Code is publicly available at: https://kappel.web.isir.upmc.fr/trajectory_primitive_website
Chinese Translation
得益于学习与机器人技术的最新进展,家庭机器人正逐渐走入家庭,旨在自主执行家务。然而,机器人在开放环境中进行自主操控任务仍然面临挑战。在此背景下,本文提出了一种方法,能够使机器人操控广泛的关节物体。我们自动生成不同的机器人低级轨迹原语,以操控给定物体的关节。在生成专家轨迹时,考虑实现同一目标的解决方案多样性是一个非常重要的因素。实际上,了解完成同一任务的多样化低级原语使机器人能够在真实环境中选择最佳解决方案,应对实时约束和意外变化。为此,我们提出了一种基于质量-多样性(Quality-Diversity)算法的方法,该方法利用稀疏奖励探索,以生成特定操控任务的一组多样化且高效的轨迹原语。我们通过在模拟中生成多样轨迹并在现实世界中部署它们来验证我们的方法QDTraj。QDTraj为铰链和滑块激活任务生成的多样轨迹至少比其他比较方法多出五倍。我们在PartNetMobility关节物体数据集的30种关节上评估了我们方法的泛化性能,每个任务的平均轨迹数量达到704条。代码可在以下网址公开获取:https://kappel.web.isir.upmc.fr/trajectory_primitive_website
cs.RO / 18 / 2604.22591

RedVLA: Physical Red Teaming for Vision-Language-Action Models

RedVLA:针对视觉-语言-行动模型的物理红队测试
Zhang, Yuhao, Zhang, Borong, Fan, Jiaming, Shen, Jiachen, Cai, Yishuai, Yang, Yaodong, Ji, Jiaming
Abstract
The real-world deployment of Vision-Language-Action (VLA) models remains limited by the risk of unpredictable and irreversible physical harm. However, we currently lack effective mechanisms to proactively detect these physical safety risks before deployment. To address this gap, we propose \textbf{RedVLA}, the first red teaming framework for physical safety in VLA models. We systematically uncover unsafe behaviors through a two-stage process: (I) \textbf{Risk Scenario Synthesis} constructs a valid and task-feasible initial risk scene. Specifically, it identifies critical interaction regions from benign trajectories and positions the risk factor within these regions, aiming to entangle it with the VLA's execution flow and elicit a target unsafe behavior. (II) \textbf{Risk Amplification} ensures stable elicitation across heterogeneous models. It iteratively refines the risk factor state through gradient-free optimization guided by trajectory features. Experiments on six representative VLA models show that RedVLA uncovers diverse unsafe behaviors and achieves the ASR up to 95.5\% within 10 optimization iterations. To mitigate these risks, we further propose SimpleVLA-Guard, a lightweight safety guard built from RedVLA-generated data. Our data, assets, and code are available \href{https://redvla.github.io}{here}.
Chinese Translation
视觉-语言-行动(VLA)模型在现实世界中的部署仍然受到不可预测和不可逆的物理伤害风险的限制。然而,目前我们缺乏有效的机制可以在部署前主动检测这些物理安全风险。为了解决这一空白,我们提出了 extbf{RedVLA},这是第一个针对VLA模型物理安全的红队框架。我们通过两个阶段的过程系统地揭示不安全行为:(I) extbf{风险场景合成}构建有效且任务可行的初始风险场景。具体而言,它从良性轨迹中识别关键交互区域,并将风险因素放置在这些区域内,旨在将其与VLA的执行流程纠缠在一起,从而引发目标不安全行为。(II) extbf{风险放大}确保在异构模型中稳定引发不安全行为。它通过不依赖梯度的优化,在轨迹特征的引导下迭代精炼风险因素状态。对六个代表性VLA模型的实验表明,RedVLA揭示了多种不安全行为,并在10次优化迭代内达到了高达95.5oldsymbol{ ext{%}}的主动安全响应率(ASR)。为了减轻这些风险,我们进一步提出了SimpleVLA-Guard,一种基于RedVLA生成数据构建的轻量级安全保护机制。我们的数据、资产和代码可在 extit{这里}获取。
cs.RO / 19 / 2604.22615

GazeVLA: Learning Human Intention for Robotic Manipulation

GazeVLA:学习人类意图以实现机器人操控
Li, Chengyang, Xiong, Kaiyi, Xu, Yuan, Qian, Lei, Wang, Yizhou, Zhu, Wentao
Abstract
Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.
Chinese Translation
具身基础模型在机器人操控方面取得了重大突破,但仍然严重依赖大规模的机器人示范。尽管最近的研究探讨了利用人类数据来减轻这种依赖,但由于人类与机器人之间固有的具身差距,有效提取可转移知识仍然是一项重大挑战。我们认为,人类行为背后的意图可以作为一种强大的中介表征,用于弥合这一差距。在本文中,我们提出了一种新颖的框架,明确学习并转移人类意图,以促进机器人操控。具体而言,我们通过注视(gaze)对意图进行建模,因为注视自然先于身体动作,并且可以作为人类意图的可观察代理。我们的模型首先在一个大规模的自我中心人类数据集中进行预训练,以捕捉人类意图及其与动作的协同,然后在一小部分机器人和人类数据上进行微调。在推理过程中,模型采用链式思维(Chain-of-Thought)推理范式,顺序预测意图,然后再执行动作。在模拟和现实世界设置下,针对长时间跨度和细致任务,以及在少量样本和鲁棒性基准测试中进行的广泛评估显示,我们的方法在性能上始终优于强基线,更好地泛化,并取得了最先进的表现。
cs.RO / 20 / 2604.22715

ATRS: Adaptive Trajectory Re-splitting via a Shared Neural Policy for Parallel Optimization

ATRS:通过共享神经策略进行自适应轨迹重分割的并行优化
Yu, Jiajun, Liu, Guodong, Wang, Li, Zhou, Pengxiang, Liu, Wentao, He, Yin, Xu, Chao, Gao, Fei, Cao, Yanjun
Abstract
Parallel trajectory optimization via the Alternating Direction Method of Multipliers (ADMM) has emerged as a scalable approach to long-horizon motion planning. However, existing frameworks typically decompose the problem into parallel subproblems based on a predefined fixed structure. Such structural rigidity often causes optimization stagnation in highly constrained regions, where a few lagging subproblems delay global convergence. A natural remedy is to adaptively re-split these stagnating segments online. Yet, deciding when, where, and how to split exceeds the capability of rule-based heuristics. To this end, we propose ATRS, a novel framework that embeds a shared Deep Reinforcement Learning policy into the parallel ADMM loop. We formulate this adaptive adjustment as a Multi-Agent Shared-Policy Markov Decision Process, where all trajectory segments act as homogeneous agents and share a unified neural policy network. This parameter-sharing architecture endows the system with size invariance, enabling it to handle dynamically changing segment counts during re-splitting and generalize to arbitrary trajectory lengths. Furthermore, our formulation inherently supports zero-shot generalization to unseen environments, as our network relies solely on the internal states of the numerical solver rather than on the geometric features of the environment. To ensure solver stability, a Confidence-Based Election mechanism selects only the most stagnating segment for re-splitting at each step. Extensive simulations demonstrate that ATRS accelerates convergence, reducing the number of iterations by up to 26.0% and the computation time by up to 19.1%. Real-world experiments further confirm its applicability to both large-scale offline global planning and real-time onboard replanning within 35 ms per cycle, with no sim-to-real degradation.
Chinese Translation
通过交替方向乘子法(ADMM)进行的并行轨迹优化已成为一种可扩展的长期运动规划方法。然而,现有框架通常基于预定义的固定结构将问题分解为并行子问题。这种结构的刚性常常导致在高度约束的区域内优化停滞,其中一些滞后的子问题延迟了全局收敛。一个自然的解决办法是在线自适应地重新分割这些停滞的片段。然而,决定何时、何地以及如何分割超出了基于规则的启发式方法的能力。为此,我们提出了ATRS,一个将共享深度强化学习策略嵌入到并行ADMM循环中的新框架。我们将这种自适应调整形式化为一个多智能体共享策略马尔可夫决策过程,其中所有轨迹片段作为同质智能体并共享一个统一的神经策略网络。这种参数共享架构赋予系统大小不变性,使其能够在重分割过程中处理动态变化的片段数量,并且可以推广到任意轨迹长度。此外,我们的构造本质上支持对未见环境的零样本泛化,因为我们的网络仅依赖于数值求解器的内部状态,而不依赖于环境的几何特征。为了确保求解器的稳定性,一个基于置信度的选择机制在每一步仅选择最为停滞的片段进行重分割。大量模拟表明,ATRS加速了收敛,将迭代次数减少了高达26.0%,计算时间减少了高达19.1%。现实世界实验进一步证实其在大规模离线全局规划和每周期35毫秒的实时机载重新规划中的适用性,且没有模拟到现实的降级。
cs.RO / 21 / 2604.22724

GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories

GCImOpt:通过模仿最优轨迹学习高效的目标条件策略
Goikoetxea, Jon, Palacián, Jesús F.
Abstract
Imitation learning is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present GCImOpt, an approach to learn efficient goal-conditioned policies by training on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and near-optimal control profiles, all while being small (less than 80,000 neural network parameters) and fast enough (up to more than 6,000 times faster than a trajectory optimization solver) that they could be deployed onboard resource-constrained controllers. We provide videos, code, datasets and pre-trained policies under a free software license; see our project website https://jongoiko.github.io/gcimopt/.
Chinese Translation
模仿学习是一种成熟的基于机器学习的控制方法。然而,它的适用性依赖于对演示样本的获取,而这些样本通常收集成本高且/或在解决任务时表现不佳。在本研究中,我们提出了GCImOpt,一种通过在轨迹优化生成的数据集上进行训练来学习高效目标条件策略的方法。我们的数据集生成方法在计算上高效,可以在笔记本电脑上在几分钟内生成数千条最优轨迹,并产生高质量的演示。进一步地,通过一种将中间状态视为目标的数据增强方案,我们能够将训练数据集的规模提升一个数量级。使用我们生成的数据集,我们训练了能够朝向任意目标控制系统的目标条件神经网络策略。为了展示我们方法的通用性,我们生成数据集并训练用于各种控制任务的策略,即小车-杆稳定、平面及三维四旋翼稳定,以及使用6自由度机器人手臂进行点位到达。我们展示了我们的训练策略可以实现高成功率和接近最优的控制特征,并且它们体积小(少于80,000个神经网络参数)且足够快速(比轨迹优化求解器快超过6,000倍),能够在资源受限的控制器上部署。我们提供视频、代码、数据集和预训练策略,使用自由软件许可证;请访问我们项目网站 https://jongoiko.github.io/gcimopt/。
计算机视觉 (Computer Vision)
61
cs.CV / 1 / 2604.21982

Forecasting Solar Energy Using a Single Image

利用单幅图像预测太阳能
Klotz, Jeremy, Nayar, Shree K.
Abstract
Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These soft costs include assessing the illumination (irradiance) of a panel, which is typically performed using a 3D model that fails to capture small nearby structures that impact the irradiance. Our approach uses a single image taken at the panel's location to forecast its irradiance at any time in the future. We use visual cues in the image to find the camera's orientation and the portion of the sky visible to the panel in order to forecast the irradiance due to the sun and the sky. In addition, we show that the irradiance due to reflections from nearby buildings varies smoothly over time and can be forecasted from the image. This approach enables assessing the solar energy potential of any surface and forecasting the temporal variation of a panel's irradiance. We validate our approach using real irradiance measurements in urban canyons. We show that our approach often yields more accurate irradiance forecasts compared to conventional irradiance-based transposition methods and 3D model-based simulations. We also show that a single spherical image can be used to find the best fixed orientation of a panel. Finally, we present Solaris, a device to capture the image seen by a panel in a variety of urban settings.
Chinese Translation
太阳能电池板越来越多地被部署在城市的屋顶、墙壁及城市基础设施上。尽管近年来电池板的成本有所降低,但其安装的软成本并未下降。这些软成本包括评估电池板的照明(辐照度),这通常使用3D模型进行评估,而该模型难以捕捉到影响辐照度的小型邻近结构。我们的方法利用在电池板位置拍摄的单幅图像,预测其未来任何时刻的辐照度。我们使用图像中的视觉线索来确定相机的朝向以及电池板可见的天空部分,以预测来自太阳和天空的辐照度。此外,我们展示了来自周围建筑物的反射辐照度随时间平滑变化,可以通过图像进行预测。这种方法使得可以评估任何表面的太阳能潜力,并预测电池板辐照度的时间变化。我们利用城市峡谷中的实际辐照度测量验证了我们的方法。结果表明,与传统的基于辐照度的转置方法和基于3D模型的模拟相比,我们的方法通常能产生更准确的辐照度预测。我们还展示了一幅单独的球形图像可以用来找到电池板的最佳固定朝向。最后,我们介绍了Solaris,一种用于在各种城市环境中捕捉电池板视角图像的设备。
cs.CV / 2 / 2604.21984

Soft Anisotropic Diagrams for Differentiable Image Representation

柔性各向异性图用于可微图像表示
Iinbor, Laki, Dou, Zhiyang, Matusik, Wojciech
Abstract
We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.
Chinese Translation
我们提出了一种柔性各向异性图(Soft Anisotropic Diagrams,SAD),这是一种通过图像平面中一组自适应点参数化的显式且可微的图像表示。在SAD中,每个点指定一个各向异性度量和一个加权的距离分数,我们通过对一个小的每像素的前K个点的softmax融合来计算像素颜色。我们诱导了一个带有可学习每个点温度的柔性各向异性加权Voronoi划分(即阿波罗尼乌斯图),保留了信息丰富的梯度,同时允许清晰的、与内容对齐的边界和显式的归属。这种公式化通过维护每次查询的前K个映射,实现了高效渲染,该映射在相同的阴影分数下近似最近邻,从而允许GPU友好且固定大小的局部计算。我们使用受跳跃泛洪启发的前K传播方案更新此列表,并通过随机注入增强,以提供概率全球覆盖。训练遵循一种GPU优先的管道,采用梯度加权初始化、Adam优化,以及通过稠密化和修剪进行自适应预算控制。在标准基准测试中,SAD在匹配比特率下持续超越Image-GS和Instant-NGP。在Kodak数据集上,SAD达到了46.0 dB的PSNR,编码时间为2.2秒(相比Image-GS的28秒),并在端到端训练速度上比最先进的基线快4到19倍。我们通过展示SAD与可微管道在正向和逆向问题上的无缝集成、快速随机访问的效率,以及紧凑存储的有效性,证明了SAD的有效性。
cs.CV / 3 / 2604.22036

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

EgoMAGIC——一个用于训练感知算法的自我中心视频现场医疗数据集
VanVoorst, Brian, Walczak, Nicholas, Gilleo, Christopher, Meissner, Charles, Felix, Fabio, Roman, Iran, Steers, Bea, Silva, Claudio, Shen, Yuhan, Lu, Zijia, Lee, Shih-Po, Elhamifar, Ehsan
Abstract
This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).
Chinese Translation
本文介绍了EgoMAGIC(医疗辅助、指导、指令与纠正),该数据集是在DARPA的感知任务指导(PTG)项目下收集的自我中心医疗活动数据集。该数据集包含3,355个视频,涵盖50个医疗任务,每个任务至少有50个标注视频。PTG项目的主要目标是开发集成在增强现实头戴设备中的虚拟助手,以帮助用户执行复杂任务。为了鼓励利用该数据集进行探索和研究,医疗训练数据与一个以八个医疗任务为重点的动作检测挑战一同发布。大多数视频是使用集成音频的头戴立体相机录制的。基于该数据集,使用195万个标签训练了40个YOLO模型,以检测124个医疗对象,为从事医疗人工智能应用的开发者提供了一个可靠的起点。除了介绍该数据集,本文还呈现了三个模型在八个选定医疗任务上的动作检测基准结果,表现最佳的方法达到了平均mAP 0.526。尽管本文主要讨论了动作检测作为基准,EgoMAGIC数据集同样适用于动作识别、物体识别与检测、错误检测以及其他具有挑战性的计算机视觉任务。该数据集可通过zenodo.org获取(DOI: 10.5281/zenodo.19239154)。
cs.CV / 4 / 2604.22045

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

H-集:赫希矩阵引导下的图像分类器集合级特征交互发现
Mehrotra, Ayushi, Bhusal, Dipkamal, Clifford, Michael, Rastogi, Nidhi
Abstract
Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.
Chinese Translation
特征归因方法通过为单个输入特征分配重要性分数,来解释深度神经网络的预测。然而,大多数现有的方法仅关注边际效应,而忽视了特征间的交互作用,即特征组共同影响模型输出。这种交互作用在图像分类任务中尤为重要,因为语义意义往往源于像素之间的相互依赖,而非孤立特征。现有基于交互的图像方法要么过于粗糙(例如,仅仅使用超像素),要么未能满足核心可解释性公理。在本研究中,我们提出了H-集,这是一种新颖的两阶段框架,用于发现和归因图像分类器中的高阶特征交互。首先,我们通过输入赫希矩阵检测局部相互作用对,并将它们递归合并成语义一致的集合;这里使用的分割来自于Segment Anything (SAM),作为一种空间分组先验,但可以被其他分割方法替代。其次,我们使用IDG-Vis对每个集合进行归因,IDG-Vis是集水平扩展的综合方向梯度,它沿着像素空间路径整合方向梯度,并利用Harsanyi红利进行聚合。尽管赫希矩阵在检测阶段引入了额外的计算成本,但这种有针对性的开销始终产生稀疏且更真实的显著性图。对ImageNet和CUB数据集上VGG、ResNet、DenseNet和MobileNet模型的评估表明,H-集生成的显著性图相比现有方法更具可解释性和忠实度。
cs.CV / 5 / 2604.22093

FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision

FLARE-BO:通过贝叶斯优化进行融合亮度和自适应Retinex增强的低光照机器人视觉
Shankar, Nathan, Ladosz, Pawel, Yin, Hujun
Abstract
Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.
Chinese Translation
在低照度环境下,可靠的视觉感知仍然是自主机器人系统面临的核心挑战,图像质量的下降直接影响到导航、检查和各种操作。最近一种无训练的方法表明,使用高斯过程的贝叶斯优化能够在每张图像的基础上自适应选择亮度、对比度和去噪参数,实现在没有任何学习模型的情况下实现竞争性的增强。然而,该框架仅限于三个参数,未进行照明分解或白平衡校正,并依赖于非局部均值(Non-Local Means)去噪,这在嘈杂条件下容易使边缘过于平滑。本文提出FLARE-BO(通过贝叶斯优化进行融合亮度和自适应Retinex增强),一个扩展框架,联合优化八个参数,包括伽马校正、LIME样式的照明归一化、色度去噪、双边滤波、NLM去噪、灰色世界自动白平衡和自适应后平滑。搜索引擎采用单位超立方体参数规范化、目标标准化、Sobol准随机初始化和对数期望改进获取,原则性地探索扩展空间。通过低光配对数据集(Low Light paired dataset,LOL)对所提方法的性能进行基准测试,结果显示该方法在未专门使用该数据集训练的现有方法中明显提升。
cs.CV / 6 / 2604.22118

Robust Camera-to-Mocap Calibration and Verification for Large-Scale Multi-Camera Data Capture

大规模多摄像头数据捕捉的鲁棒摄像头与运动捕捉系统标定与验证
Liu, Tianyi, Twigg, Christopher, Grady, Patrick, Harris, Kevin, Han, Shangchen, He, Kun
Abstract
Optical motion capture (mocap) systems are widely used for ground-truth capture in AR/VR, SLAM and robotics datasets. These datasets require extrinsic calibration to align mocap coordinates to external camera frames -- a step that is subject to multiple sources of error in practice, and failures often go undetected until they corrupt downstream data. These issues are compounded for fisheye cameras, where spatially non-uniform distortion makes both calibration and verification more challenging. We present a calibration and verification system designed for this setting. Concretely, we target robustness to board-to-marker attachment variation, optimization initialization ambiguity, and session-to-session calibration drift after deployment. The calibration jointly estimates camera extrinsics and the board-to-marker transform, and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, \lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, our calibration outperforms existing benchwork, and lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.
Chinese Translation
光学运动捕捉(mocap)系统在增强现实/虚拟现实(AR/VR)、同时定位与地图构建(SLAM)和机器人数据集的真实情况捕捉中被广泛采用。这些数据集需要进行外部标定,以将运动捕捉坐标与外部摄像头框架对齐——这一过程在实践中容易受到多个误差来源的影响,且失败往往在数据下游被破坏之前未被检测到。对于鱼眼摄像头,这些问题更加复杂,因为空间上不均匀的畸变使得标定和验证变得更加具有挑战性。我们提出了一种为此设定设计的标定与验证系统。具体而言,我们的目标是针对板与标记附着变化、优化初始化的模糊性以及在部署后会话间标定漂移的鲁棒性。标定过程同时估计摄像头的外部参数和板与标记的变换,并使用分阶段求解器来改善在模糊初始化下的收敛可靠性。验证组件lollypop通过一个完全独立于标定数据的测量链提供快速的、操作员无关的评估。在使用鱼眼摄像头的Meta Quest 3耳机的实验中,我们的标定方法优于现有基准,而lollypop则能够可靠地检测到标定随时间的退化。该系统已在生产数据收集流程中投入使用。
cs.CV / 7 / 2604.22129

PAGaS: Pixel-Aligned 1DoF Gaussian Splatting for Depth Refinement

PAGaS:像素对齐的一维自由度高斯溅射用于深度细化
Recasens, David, Maier, Robert, Bozic, Aljaz, Grabli, Stephane, Civera, Javier, Tung, Tony, Boyer, Edmond
Abstract
Gaussian Splatting (GS) has emerged as an efficient approach for high-quality novel view synthesis. While early GS variants struggled to accurately model the scene's geometry, recent advancements constraining the Gaussians' spread and shapes, such as 2D Gaussian Splatting, have significantly improved geometric fidelity. In this paper, we present Pixel-Aligned 1DoF Gaussian Splatting (PAGaS) that adapts the GS representation from novel view synthesis to the multi-view stereo depth task. Our key contribution is modeling a pixel's depth using one-degree-of-freedom (1DoF) Gaussians that remain tightly constrained during optimization. Unlike existing approaches, our Gaussians' positions and sizes are restricted by the back-projected pixel volumes, leaving depth as the sole degree of freedom to optimize. PAGaS produces highly detailed depths, as illustrated in Figure 1. We quantitatively validate these improvements on top of reference geometric and learning-based multi-view stereo baselines on challenging 3D reconstruction benchmarks. Code: davidrecasens.github.io/pagas
Chinese Translation
高斯溅射(Gaussian Splatting, GS)已成为高质量新视角合成的一种有效方法。尽管早期的 GS 变体在准确建模场景几何结构方面面临挑战,但最近对高斯的扩展和形状的约束,例如二维高斯溅射(2D Gaussian Splatting),显著提高了几何真度。在本文中,我们提出了像素对齐的一维自由度高斯溅射(Pixel-Aligned 1DoF Gaussian Splatting, PAGaS),该方法将 GS 表征从新视角合成适应于多视角立体深度任务。我们主要的贡献在于使用一维自由度(1DoF)高斯模型捕捉像素的深度,并在优化过程中保持紧密约束。与现有方法不同,我们的高斯位置和大小受到反投影像素体积的限制,将深度作为唯一的自由度进行优化。PAGaS 生成的深度高度细致,如图 1 所示。在挑战性的三维重建基准上,我们在参考几何和基于学习的多视角立体基准上定量验证了这些改进。代码链接:davidrecasens.github.io/pagas
cs.CV / 8 / 2604.22139

Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography

解剖学意识的无监督视网膜异常检测与定位在光学相干断层成像中的应用
Haghighi, Tania, Gholami, Sina, Tabkhi, Hamed, Alam, Minhaj Nur
Abstract
Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.
Chinese Translation
光学相干断层成像(Optical Coherence Tomography, OCT)图像的可靠自动分析对诊断视网膜疾病至关重要,但面临着一个关键障碍:对昂贵且劳动密集型的专家标注的需求。由于监督深度学习模型在注释的异常词汇有限,这使得它们在不同病理、成像设备和患者人群之间的泛化能力受到限制。我们提出了一种无监督异常检测框架,能够在没有病变注释的情况下学习健康视网膜解剖的规范分布,直接解决临床应用中的注释效率挑战。我们的方法利用在正常B扫描图像上训练的离散潜在模型,以捕捉特定于OCT的结构模式。为了增强临床稳健性,我们结合了视网膜层意识的监督和结构三元组学习,以区分健康和病理表征,提高在不同成像条件下模型的可靠性。在推理过程中,异常通过重建差异来检测和定位,使得能够在不需要特定疾病标签的情况下进行图像和像素级的识别。在Kermany数据集上(AUROC: 0.799),我们的方法显著优于VAE、VQVAE、VQGAN和f-AnoGAN基线。重要的是,在Srinivasan的数据集上的跨数据集评估取得了AUROC 0.884,并表现出优越的泛化能力,展示了强大的领域适应性。在外部RETOUCH基准上,无监督异常分割实现了具有竞争力的Dice(0.200)和mIoU(0.117)得分,验证了跨机构的可重复性。
cs.CV / 9 / 2604.22160

GenMatter: Perceiving Physical Objects with Generative Matter Models

生成物质模型:感知物理对象
Li, Eric, Dasgupta, Arijit, Friedman, Yoni, Huot, Mathieu, Mansinghka, Vikash, O'Connell, Thomas, Freeman, William T., Tenenbaum, Joshua B.
Abstract
Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
Chinese Translation
人类视觉感知为理解基于运动的场景解释的计算原理提供了宝贵的见解。人类能够稳定地检测并分割移动实体,这些实体构成了可以独立移动的物质块,无论是观察稀疏的运动点、纹理表面,还是自然场景。相比之下,现有的计算机视觉系统缺乏在这些多样环境中统一的方法。受到人类感知原理的启发,我们提出了一种生成模型,该模型将低级的运动线索和高级的外观特征进行分层组合,形成粒子(代表局部物质的小高斯分布),并将粒子聚合为捕捉到一致且独立移动的物理实体的簇。我们开发了一种基于并行化块吉布斯采样的硬件加速推理算法,以恢复稳定的粒子运动和聚类。我们的模型可以处理不同类型的输入(随机点、风格化纹理或自然RGB视频),使其能够在生物视觉成功但现有计算机视觉方法不成功的环境中工作。我们在三个领域验证了这一统一框架:在2D随机点运动图上,我们的方法捕捉到人类物体感知,包括在模糊条件下的逐级不确定性;在一个受格式塔启发的伪装旋转物体数据集上,我们的方法从运动中恢复正确的3D结构,从而实现准确的2D物体分割;在自然RGB视频上,我们的模型跟踪构成变形物体的移动3D物质,从而实现稳健的物体级场景理解。因此,这项工作建立了一个基于运动的感知的一般框架,该框架扎根于人类视觉的原理。
cs.CV / 10 / 2604.22162

SAMIDARE: Advanced Tracking-by-Segmentation for Dense Scenarios

SAMIDARE:用于密集场景的先进基于分割的多目标跟踪
Hirano, Shozaburo, Ukita, Norimichi
Abstract
Automated sports analysis demands robust multi-object tracking (MOT), yet segmentation-based methods often struggle with mask errors and ID switches in dense scenes. We propose SAMIDARE, a framework that enhances SAM2MOT for crowded scenes through three key components: (1) density-aware mask re-generation and (2) selective memory updates, both for adaptive mask control to preserve target feature integrity, and (3) state-aware association and new track initialization, which improves robustness under mutual occlusions and frequent frame-out events. Evaluated on the SportsMOT dataset, SAMIDARE achieves state-of-the-art performance, outperforming the baseline by 2.5 HOTA and 4.2 IDF1 points on the validation set. These results demonstrate that adaptive feature management using mask control and state-aware association provide a robust and efficient solution for dense sports tracking. Code is available at https://github.com/ZabuZabuZabu/SAMIDARE
Chinese Translation
自动化运动分析需要强大的多目标跟踪(MOT)能力,但基于分割的方法常常在密集场景中遭遇掩膜错误和身份切换问题。我们提出了SAMIDARE,一个通过三个关键组件增强SAM2MOT以应对拥挤场景的框架:(1)密度感知的掩膜再生成,(2)选择性记忆更新,用于自适应掩膜控制以保持目标特征的完整性,以及(3)状态感知的关联与新轨道初始化,旨在提高在相互遮挡和频繁掉帧事件下的鲁棒性。在SportsMOT数据集上的评估显示,SAMIDARE实现了当前的最优性能,在验证集上比基线提高了2.5个HOTA和4.2个IDF1点。这些结果表明,利用掩膜控制和状态感知关联进行适应性特征管理为密集运动跟踪提供了一个鲁棒且高效的解决方案。代码可在https://github.com/ZabuZabuZabu/SAMIDARE获取。
cs.CV / 11 / 2604.22164

Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models

基于变压器模型的配对交互数据学习反应性人类动作生成
Soga, Masato, Takebayashi, Ryuki
Abstract
Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person's motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.
Chinese Translation
近年来,深度学习的进步使得从文本描述生成视频,以及从输入视频预测未来序列成为可能。同样,在人类动作建模中,可以通过文本生成动作或从单一个体的动作序列中进行预测。然而,这些方法主要集中在单一代理的动作生成上。相反,本研究解决了在交互场景中基于一个人的动作生成另一个人动作的问题,其中两者动作是相互依赖的。我们构建了一个从拳击比赛视频中提取的配对动作-反应动作序列的数据集,并研究了基于变压器模型在该任务中的有效性。具体而言,我们实现并比较了三个模型:简单变压器(Simple Transformer)、iTransformer和Crossformer。此外,我们引入了个体ID嵌入,以明确区分不同个体,使得模型能够维持结构一致性并更好地捕捉交互动态。实验结果表明,简单变压器能够生成可信的交互感知动作,而不会出现姿态崩溃,而iTransformer和Crossformer则随着时间的推移累积错误,导致动作生成不稳定。此外,所提出的个体ID嵌入有助于防止结构崩溃并提高动作一致性。这些结果强调了在交互感知动作生成中显式建模个体身份的重要性。
cs.CV / 12 / 2604.22174

Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery

解锁光学先验:基于谱引导的知识迁移用于合成孔径雷达广义类别发现
Xia, Jingyuan, Hu, Ruikang, Li, Ye, Yang, Zhixiong, Lan, Xu, Lu, Zhejun
Abstract
Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.
Chinese Translation
广义类别发现(GCD)在标签稀缺的合成孔径雷达(SAR)领域具有重要的潜力,但其有效性受到大型视觉模型(LVMs)固有光学先验与SAR图像之间跨模态不兼容的严重限制。现有的领域适应方法往往缺乏反映成像特征的归纳偏见,因此未能有效地将光学先验迁移到SAR领域。为了解决这一问题,提出了模态差异曲线(MDC),将跨模态差异建模为一种源自光谱能量分布的结构化频域描述符。基于这一框架,我们提出了MDC引导的跨模态先验迁移(MCPT)框架,这是一种在成对光学-SAR数据上进行预训练的范式。在该框架中,自适应频率标记化(AFT)将MDC转换为可学习的标记,而频率感知专家细化(FER)则使用这些标记执行带宽差异感知的特征细化。基于细化的表示,对比学习跨模态对齐细化的嵌入,并内化适应模式。最终,在成对预训练中学到的卓越SAR特征表示能力被应用于下游单模态SAR-GCD任务。大量实验证明,在多个主流数据集上取得了最先进的性能,表明频域差异建模能够更有效地将光学先验适应于SAR图像。
cs.CV / 13 / 2604.22177

Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

单编码器与多编码器的结合:缺失模态下大脑肿瘤分割的融合前表征方法
Song, Peibo, Xue, Xiaotian, Zhang, Jinshuo, Wang, Zihao, Liu, Jinhua, Fu, Shujun, Bao, Fangxun, Yeo, Si Yong
Abstract
Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME
Chinese Translation
多模态MRI提供了用于大脑肿瘤分割的互补信息,但临床扫描通常缺少一项或多项模态,从而降低分割性能。在本文中,我们提出了UniME(单编码器与多编码器的结合),这是一种针对缺失模态的大脑肿瘤分割的两阶段异构方法,旨在平衡细致结构捕获、跨模态互补建模和可用模态的利用。其核心思想是通过两阶段的异构结构将表征学习与分割解耦。第一阶段利用掩码图像建模对单个ViT(视觉转置)单编码器进行预训练,以建立一个对缺失模态鲁棒的统一表征。第二阶段则添加特定模态的CNN(卷积神经网络)多编码器,以提取高分辨率、多尺度、细粒度特征。我们将这些特征与全局表征进行融合,以实现精确分割。对BraTS 2023和BraTS 2024的实验结果表明,在不完整的多模态场景下,UniME的表现优于以往的方法。代码可在https://github.com/Hooorace-S/UniME获得。
cs.CV / 14 / 2604.22183

EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting

EvFlow-GS:基于光流的事件增强运动去模糊三维高斯点云重建
An, Feiyu, Deng, Yufei, Zhang, Zihui, Xiao, Rong
Abstract
Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.
Chinese Translation
仅通过运动模糊图像实现清晰的三维重建变得具有挑战性,这促使最近的方法结合事件相机,利用微秒级的时间分辨率。然而,由于来自不准确的事件双重积分先验和噪声、模糊事件的误导性监督,它们在残余伪影和模糊纹理细节方面存在问题。在本研究中,我们提出了EvFlow-GS,这是一个统一的框架,利用事件流和光流优化端到端可学习的双重积分(LDI)、相机姿态和三维高斯点云重建(3DGS),并在运行时共同优化。具体而言,我们首先利用光流从事件中提取边缘信息,然后为不同模块单独制定了一种新型基于事件的损失。此外,我们利用一种新颖的事件残差先验来强化来自3DGS渲染图像之间强度变化的监督。最后,我们将3DGS和LDI的输出整合为一个联合损失,使得它们的优化能够相互促进。实验结果表明,EvFlow-GS的性能处于领先地位。
cs.CV / 15 / 2604.22190

From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

从全球到本地:重新思考CLIP特征聚合用于行人重识别
Zheng, Aotian, Sun, Winston, Alattar, Bahaa, Ablavsky, Vitaly, Hwang, Jenq-Neng
Abstract
CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.
Chinese Translation
基于CLIP的行人重识别(ReID)方法将空间特征聚合为一个单一的全球 exttt{[CLS]}标记,这一标记是针对图像-文本对齐而优化的,而非空间选择性,从而使得表示在遮挡和跨摄像头变化下变得脆弱。我们提出了SAGA-ReID,通过将中间补丁标记与在CLIP文本嵌入空间中参数化的锚向量对齐,重构身份表示——强调空间稳定的证据,同时压制受损或缺失的区域,而无需针对单个图像的文本描述。控制实验在两种定性不同的条件下分离聚合机制——合成掩蔽,在这种情况下身份信号缺失,以及现实的人为干扰者,在这种情况下,重叠的人引入语义上混淆的信号——在两种条件下,SAGA相较于全局池化的优势在遮挡增加时显著提升。基准评估结果确认,在标准和遮挡设置下,SAGA均在CLIP-ReID上获得了一致的提升,其中在全局池化最不可靠的情况下,提升幅度最大:在遮挡基准上最高可达+10.6的Rank-1。SAGA的聚合在更强大的骨干网络上超越了专用的顺序补丁聚合,确认了结构化重构解决了仅凭骨干网络质量和架构复杂性无法解决的瓶颈。代码可在https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID获得。
cs.CV / 16 / 2604.22192

CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

CharTide:通过三视角调优和询问驱动演化的数据中心图表到代码生成
Zheng, Xiangxi, He, Kuang, Hu, Jiayi, Yu, Ping, Yan, Rui, Yao, Yuan, Hou, Peng, Zeng, Anxiang, Wang, Alex Jinpeng
Abstract
Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
Chinese Translation
图表到代码生成要求视觉精度和语法正确性来自视觉-语言模型(VLMs)的严格把控。然而,现有的方法在根本上受到以数据为中心的限制:尽管图表到代码的数据集日渐增多,但仅仅扩大同质的图表-代码对将视觉感知与程序逻辑混淆,阻碍模型充分利用多模态监督的丰富性。我们提出了CharTide,一个新的数据中心框架,系统性地重新设计了图表到代码生成的训练和对齐数据。首先,我们通过三视角调优策略构建了一个包含200万样本的数据集,明确将训练分解为视觉感知、纯文本代码逻辑和模态融合流,使得一个70亿参数的模型能仅利用监督数据超越专门的基准。其次,我们将对齐重新表述为数据验证问题,而非启发式评分任务。为此,我们引入了一个基于信息不变性原则的询问驱动强化学习框架:下游模型应该对相同的视觉查询在原始和生成的图表上给出一致的答案。超越僵化的规则匹配或VLM评分,我们采用一个冷冻的检查器通过原子问答任务客观验证生成的图表,提供基于答案准确性的可验证奖励信号。在ChartMimic、Plot2Code和ChartX上的实验表明,CharTide-7B/8B显著超越开源基准,超过GPT-4o,并与GPT-5具有竞争力。
cs.CV / 17 / 2604.22202

ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild

ArchSym:在真实场景中检测3D基础建筑对称性
Chen, Hanyu, Cai, Ruojin, Marschner, Steve, Snavely, Noah
Abstract
Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.
Chinese Translation
对称性检测是计算机视觉中的一个基本问题,而对称性作为强有力的先验知识对下游任务至关重要。然而,现有基于学习的方法在从单幅图像中检测3D对称性时,几乎完全依赖于以物体为中心的或合成的数据集进行训练和评估,因此无法泛化到真实场景。此外,由于单目输入固有的尺度模糊性,使得对3D平面的定位成为一个不适定问题,许多现有研究仅能预测平面的方向。本文通过提出首个从单幅真实RGB图像中检测3D基础反射对称性的框架,解决了这些局限,重点关注建筑地标。我们引入了两个关键创新:(1)一个可扩展的数据注释管道,通过利用跨视图图像匹配,自动整理大型建筑对称性数据集ArchSym,基于SfM重建;(2)在该数据集基础上,开发了一个单视图对称性检测器,通过将对称性参数化为相对于预测场景几何的有符号距离图,准确定位3D中的对称性。我们将对称性注释管道与基于几何的替代方案进行了验证,结果表明我们的对称性检测器在新的基准数据集上显著超越了最先进的基线模型。
cs.CV / 18 / 2604.22220

Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework

频域中的水印破解:一种调制扩散攻击框架
Wang, Chunpeng, Qu, Binyan, Wang, Xiaoyu, Xia, Zhiqiu, Zhang, Shanshan, Liu, Yunan, Li, Qi
Abstract
Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.
Chinese Translation
数字图像水印技术在生成性人工智能版权保护方面发展迅速,但水印攻击技术的相对有限进展打破了攻击防御平衡,阻碍了该领域的进一步发展。本文提出了 FMDiffWA,一种用于水印攻击的频域调制扩散框架。具体而言,我们引入了频域水印调制(FWM)模块,并将其融入到前向和反向扩散过程的采样阶段。该机制能够选择性地调制与水印相关的频率分量,从而使 FMDiffWA 在有效中和隐形水印信号的同时,保持被攻击水印图像的感知质量。为了在攻击效果和视觉保真度之间实现更好的平衡,我们通过增添辅助精炼约束对传统扩散模型的训练策略进行了重构,强化了经典噪声估计目标。全面的实验表明,FMDiffWA 在视觉保真度上优于现有水印攻击,并在不同水印方案中表现出强大的泛化能力。
cs.CV / 19 / 2604.22226

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

面向长格式体育视频的时间组合推理
Cao, Siyu, Zhang, Lu, Zeng, Ruizhe, Liu, Zhi-yong
Abstract
Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
Chinese Translation
体育视频是多模态理解中的一个具有挑战性的领域,因为它们涉及复杂且动态的人类活动。尽管多模态大型语言模型(MLLMs)取得了快速进展,但在体育视频中进行长时间推理仍然困难,因为回答问题需要定位时间上稀疏的证据并将其整合到推理中。我们将这一限制归因于两个密切相关的因素:对时间上分散证据的监督不足,以及缺乏要求模型识别、定位和证明时间证据的方法。为了解决这些问题,我们提出了SportsTime,这是一个用于长格式体育视频理解的大规模基准,其中包含超过14,000个开放式问答对和超过50,000个逐步时间证据注释。在SportsTime的基础上,我们提出了时间链推理(Chain-of-Time Reasoning,CoTR),将推理视为时间上有据的证据组合过程。具体而言,在训练过程中,CoTR引入了一种时效奖励的GRPO,以鼓励时间上有据的推理。在推理过程中,它采用了锚定-观察-推断的证据寻求循环,以迭代地定位、验证和组合时间证据,然后生成最终答案。实验表明,SportsTime作为基准的有效性和CoTR的有效性在强大的MLLM基线之上不断提高了时间组合推理和逐步基础质量。
cs.CV / 20 / 2604.22240

OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

OccDirector:基于语言的行为和交互生成在4D占用空间中的应用
Liang, Zhuding, Yan, Tianyi, Chen, Dubing, Zheng, Jiasen, Zheng, Huan, Xu, Cheng-zhong, Wang, Yida, Zhan, Kun, Shen, Jianbing
Abstract
Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.
Chinese Translation
生成性世界模型越来越依赖于4D占用数据以实现逼真的自主驾驶模拟。然而,现有的生成框架依赖于刚性几何条件(例如显式轨迹)或简单属性级文本,未能有效协调复杂的、顺序的多智能体交互。为了解决这一语义时空差距,我们提出了OccDirector,这是一个开创性的框架,仅基于自然语言生成4D占用动态。作为一个“场景导演”,OccDirector将自然语言脚本映射为物理上合理的体素动态,无需几何先验。技术上,它采用VLM驱动的时空多模态动态转换(Spatio-Temporal MMDiT),并配备历史前缀锚定策略,以确保长时间范围内的交互一致性。此外,我们引入了OccInteract-85k,这是一个新颖的数据集,独特地标注了多层级语言指令:从静态布局到复杂的多智能体行为,以及一个新的基于VLM的评估基准。大量实验证明,OccDirector在生成质量和指令遵循能力上取得了最先进的成果,成功地将范式从外观合成转变为基于语言的行为协调。
cs.CV / 21 / 2604.22260

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

朝向安全出行:通过开放式视觉语言数据集构建的统一交通基础模型
Huang, Wenhui, Zhang, Songyan, Chua, Collister, Liang, Yang, Mao, Zhiqi, Yang, Heng, Lv, Chen
Abstract
Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
Chinese Translation
城市交通系统面临日益增长的安全挑战,需要具有可扩展智能的应对新兴智能出行基础设施。尽管近年来基础模型和大规模多模态数据集的发展增强了智能交通系统(ITS)的感知和推理能力,但现有研究仍主要集中于微观自动驾驶(AD),对城市规模的交通分析关注有限。特别是,面向安全的开放式视觉问答(VQA)及其相应的基础模型在异构路侧摄像头观测中的推理仍未得到充分探索。为了填补这一空白,我们引入了陆地交通数据集(LTD),这是一个针对城市交通环境的开放式推理的大规模开源视觉语言数据集。LTD包含从异构路侧摄像头收集的11600对高质量VQA,涵盖多样的道路几何、交通参与者、光照条件和恶劣天气。该数据集整合了三项互补任务:细粒度多目标标注、多图像摄像头选择和多图像风险分析,要求在最小相关视角上进行联合推理,以推断危险物体、影响因素和风险道路方向。为了确保标注的准确性,我们结合多模型视觉语言生成与交叉验证及人机协作的优化。在LTD的基础上,我们进一步提出了UniVLT,这是一种通过基于课程的知识迁移训练的交通基础模型,旨在将微观AD推理和宏观交通分析统一于单一架构中。在LTD和多个AD基准上的广泛实验表明,UniVLT在各类领域的开放式推理任务中实现了SOTA(最先进)性能,同时揭示了现有基础模型在复杂多视角交通场景中的局限性。
cs.CV / 22 / 2604.22274

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

CAGE-SGG:用于开放词汇场景图生成的反事实主动图证据
Guang, Suiyang, Liu, Chenyu, Zhang, Ruohan, Chen, Siyuan
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
Chinese Translation
开放词汇场景图生成(SGG)旨在使用灵活且细致的关系短语来描述视觉场景,超越固定的谓词词汇。尽管近期的视觉-语言模型显著扩展了SGG的语义覆盖范围,但也带来了一项关键的可靠性问题:预测的关系可能受到语言先验或对象共现的驱动,而非基于扎根的视觉证据。本文提出了一种基于反事实关系验证的证据基础开放词汇SGG框架。我们的方法不是直接接受合理的关系提议,而是验证每个候选关系是否得到特定关系的视觉、几何和上下文证据的支持。具体而言,我们首先通过视觉-语言提议者生成开放词汇关系候选,然后将谓词短语分解为支持、接触、包含、深度、运动和状态等软证据基础。关系条件证据编码器提取与谓词相关的线索,而反事实验证器则测试在必要证据被移除时关系分数是否下降,并在不相关扰动下保持稳定。我们进一步引入矛盾意识谓词学习和图级偏好优化,以提高细粒度的区分能力和全局图一致性。对传统开放词汇和全景SGG基准的实验表明,我们的方法一致地提升了基于标准召回的指标、未见谓词的泛化能力以及反事实基础质量。这些结果证明,从关系生成转向关系验证能够带来更可靠、可解释且基于证据的场景图。
cs.CV / 23 / 2604.22280

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

超越思维链:重写作为生成性多模态嵌入的通用接口
Wu, Peixi, Mei, Ke, Ma, Feipeng, Chai, Bosong, Lan, Zhibin, Zhao, Chenxi, Yan, Shannan, Chen, Jie, Hu, Zhangchi, Peng, Yansong, Lin, Bo, Zhou, Junjie, Yin, Dacheng, Wang, Tianyi, Rao, Fengyun, Lyu, Jing, Li, Hebei, Sun, Xiaoyan
Abstract
Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
Chinese Translation
多模态大型语言模型(MLLMs)已成为通用多模态嵌入的有希望的基础。近期研究表明,以推理驱动的生成性多模态嵌入在多个嵌入任务中优于判别性嵌入。然而,思维链(Chain-of-Thought, CoT)推理往往会产生冗余的思维步骤,并在更广泛的检索场景中引入摘要答案的语义模糊。为了解决这一局限性,我们提出了基于重写驱动的多模态嵌入(Rewrite-driven Multimodal Embedding, RIME),这是一个通过检索友好的重写共同优化生成和嵌入的统一框架。同时,我们提出了跨模式对齐(Cross-Mode Alignment, CMA)来连接生成性和判别性嵌入空间,使得灵活的相互检索可以在效率和准确性之间进行权衡。在此基础上,我们还引入了精细强化学习(Refine Reinforcement Learning, Refine-RL),将判别性嵌入视为稳定的语义锚点,以指导重写优化。在MMEB-V2、MRMR和UVRB上的大量实验证明,RIME在显著减少思维长度的同时,远远优于之前的生成性嵌入模型。
cs.CV / 24 / 2604.22281

DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

DocPrune:通过背景、问题和理解感知的高效文档问答中的令牌修剪
Choi, Joonmyung, Lee, Sanghyeok, Kim, Jongha, Kim, Sehyung, Ko, Dohwan, Kil, Jihyung, Kim, Hyunwoo J.
Abstract
Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
Chinese Translation
近期视觉语言模型的进展在多种多模态任务中展现了显著的性能,包括利用文本、表格和图形的结构化视觉线索进行的文档问答。然而,与自然图像不同,文档图像包含大量背景信息,并且只有稀疏的支持证据,这导致在处理长文档时大量计算资源的低效消耗。我们观察到,现有的自然图像和视频中的令牌缩减方法未能有效利用文档特有的结构稀疏性。为了解决这个问题,我们提出了DocPrune,一种无需训练的逐步文档令牌修剪框架,旨在实现高效的长文档理解。该方法仅保留任务所需的关键令牌,同时去除不必要的令牌,如背景或与问题无关的令牌。此外,它会根据模型的理解水平自动选择合适的层级来启动令牌修剪。我们在M3DocRAG上的实验表明,DocPrune在编码器和解码器中分别提高了3.0倍和3.3倍的吞吐量,同时将F1分数提升了+1.0,实现了更高的准确性和效率,而无须额外的训练。
cs.CV / 25 / 2604.22296

Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment

评估用于月球环境合成图像模拟的开源图像模拟解决方案
Singla, Jai G, Patel, Hinal B, Dube, Nitant
Abstract
Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA's Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon's surface environment.
Chinese Translation
合成图像生成是行星任务的重要输入之一。它使研究人员和工程师能够在实际部署之前在虚拟环境中可视化规划的行星任务、测试成像系统和规划探测活动。图像模拟对于评估着陆点、检测危害以及验证导航系统在任务中的有效性至关重要。本研究对多种月球环境下的图像模拟方法进行了详细评估,特别强调了不同相机模型和光照条件对合成月球图像质量的影响。这些图像是利用真实的数字高程模型(Digital Elevation Models, DEM)以及分别来自于恆星号探测器(Chandrayaan-2 Orbiter High Resolution Camera, OHRC)和美国国家航空航天局(NASA)的广角相机(Wide Angle Camera, WAC)、窄角相机(Narrow Angle Camera, NAC)的地形数据生成的。本研究旨在提高合成图像在支持月球探索中自主导航和决策系统的可靠性。这项工作为生成未来月球任务所需的重要信息开发更有效的工具做出了贡献,并增强了对月球表面环境的理解。
cs.CV / 26 / 2604.22302

Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation

知识可视化:知识密集型文本到图像生成的基准和方法
Zhao, Ran, Jin, Sheng, Wu, Size, Liao, Kang, Gong, Zerui, Guo, Zujin, Xiao, Yang, Li, Wei
Abstract
Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at https://github.com/zhaoran66/KVBench.
Chinese Translation
近期的文本到图像(T2I)模型在照片真实感合成和指令遵循方面展现了令人印象深刻的能力。然而,它们在知识密集型环境下的可靠性仍然基本未得到探索。与自然图像生成不同,知识可视化不仅需要语义对齐,还需要严格遵守领域知识、结构性约束和符号惯例,这揭示了视觉可信度与科学正确性之间的关键差距。为系统性地研究这一问题,我们引入了KVBench,这是一个基于课程的基准,用于评估知识密集型T2I生成。KVBench涵盖了六个高中科目:生物、化学、地理、历史、数学和物理。该基准由1800个经过专家审定的提示构成,来源于30多本权威教科书。使用这一基准,我们评估了14个最尖端的开源和闭源模型,揭示了在逻辑推理、符号精准度和多语言鲁棒性方面存在重大缺陷,其中开源模型的表现始终低于专有系统。为解决这些局限性,我们进一步提出了KE-Check,这是一个两阶段框架,通过(1)知识细化进行结构化提示丰富,以及(2)清单指导的精细化通过违反识别和约束指导编辑进行明确的约束执行,从而提高科学忠实度。KE-Check有效缓解了科学幻觉,缩小了开源和领先闭源模型之间的性能差距。数据和代码已在https://github.com/zhaoran66/KVBench上公开。
cs.CV / 27 / 2604.22310

Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

基于双趋同线的几何混淆技术在隐私保护图像查询中的再探讨
Kim, Jeonggon, Moon, Heejoon, Hong, Je Hyeong
Abstract
Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
Chinese Translation
隐私保护图像查询(PPIQ)是一种新兴的云端视觉定位机制,它通过混淆特征而非私有图像或原始关键点来实现姿态估计。然而,目前PPIQ的主要方法,主要是基于几何和分割的混淆,均存在易受近期隐私攻击的漏洞。尤其是,基于几何的混淆的一个根本限制在于,混淆邻近线的空间分布仍然有效地围绕着原始关键点位置,使得恢复原始点的线索可供利用。我们重新审视这一几何范式,并引入双趋同线(Dual Convergent Lines, DCL),一种新颖的关键点混淆方法,展示了对这类攻击的强大抗性。DCL在一条中心分割线上放置两个固定的锚点,并将每个关键点提升到源自其中一个锚点的线,其中活动锚点由关键点的位置决定。这种安排通过使攻击优化变得不适定来使几何恢复攻击失效:邻近线要么误导性地趋于一个锚点,导致一个平凡解,要么在分割边界接近平行,导致一个不稳定的高方差解。这两种结果都阻碍了点的恢复。DCL还兼容现有的基于线的求解器,能够在传统定位管道中部署。在室内和大规模户外数据集上进行的实验表明,DCL在隐私攻击下的稳健性、效率和可扩展性,同时实现了实用的定位性能。
cs.CV / 28 / 2604.22331

Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

深度感知探测器:边缘人工智能与单目视觉在现实世界中的应用研究
Relia, Lomash, Singla, Jai G, Amitabh, Dube, Nitant
Abstract
This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV's StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.
Chinese Translation
本研究分析了深度感知探测器导航的模拟与现实世界实施,强调了从立体视觉向单目深度估计的过渡,采用边缘人工智能技术。研究使用基于Unity的月球地形模拟器,结合立体摄像头和OpenCV的StereoSGBM生成视差图。基于Raspberry Pi 4构建的实际探测器采用UniDepthV2进行单目计量深度估计,并使用YOLO12n进行实时目标检测。尽管立体视觉在模拟中提供了更高的准确性,单目方法在现实世界部署中展现了更强的鲁棒性和成本效益,实现了0.1 FPS的深度估计和10 FPS的目标检测。
cs.CV / 29 / 2604.22333

ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding

ChangeQuery:推进自然及人类诱发灾害的遥感变化分析,从视觉检测到语义理解
Sun, Dongwei, Yao, Jing, Wei, Kan, Cao, Xiangyong, Wu, Chen, Zhao, Zhenghui, Ghamisi, Pedram, Zhou, Jun, Benediktsson, Jón Atli
Abstract
Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.
Chinese Translation
在灾后响应中,快速的情境意识至关重要。尽管遥感损伤评估正在从像素级变化检测向高级语义分析发展,但现有的视觉-语言方法仍然难以为复杂的战略查询提供可操作的情报。它们受到单模态光学依赖、对自然灾害的偏见以及缺乏基于情境的交互性等限制。为了解决这些问题,我们提出了ChangeQuery,这是一个统一的多模态框架,旨在实现全面的全天候灾害情境意识。为了克服模态限制和场景偏见,我们构建了灾害诱发变化查询(Disaster-Induced Change Query, DICQ)数据集,这是一个大规模基准,结合了事件前的光学语义与事件后SAR(合成孔径雷达)结构特征,涵盖自然灾害与武装冲突的平衡分布。此外,为了提供进行交互推理所需的高质量监督,我们提出了一种新颖的自动化语义注释管道。该系统遵循“优先统计,后生成”的范式,自动将原始分割掩膜转化为基于情境的层级指令集,有效地为模型赋予了细粒度的空间和定量感知。基于这类结构化数据训练的ChangeQuery架构充当了一个交互式灾害分析师,支持多任务推理,满足多样的用户查询,提供精确的损害量化、区域特定描述及全面的灾后摘要。广泛的实验表明,ChangeQuery建立了新的最先进水平,为复杂的灾害监测提供了强大且可解释的解决方案。代码可在 exttt{https://sundongwei.github.io/changequery/} 获取。
cs.CV / 30 / 2604.22334

FILTR: Extracting Topological Features from Pretrained 3D Models

FILTR:从预训练三维模型中提取拓扑特征
Martinez, Louis, Ovsjanikov, Maks
Abstract
Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
Chinese Translation
近年来,三维点云编码器(例如,Point-BERT、Point-MAE)的预训练取得了显著进展,产生了强大的模型,其能力通常在几何或语义任务上进行评估。同时,拓扑描述符已被证明能够提供形状多尺度结构的有信息的摘要。本文提出了一个问题,即是否可以从3D编码器生成的特征中导出拓扑信息。为了解决这个问题,我们首先引入了DONUT,这是一个具有可控拓扑复杂性的合成基准,并提出了FILTR(Filtration Transformer),一种可学习的框架,用于直接从冻结编码器预测持久性图。FILTR调整了一个变换器解码器,将图生成视为一个集合预测任务。我们在DONUT上的分析表明,现有的编码器仅保留有限的全局拓扑信号,然而FILTR成功地利用这些编码器生成的信息来逼近持久性图。我们的方法首次通过一个高效的可学习前馈机制,从原始点云数据中驱动提取持久性图。
cs.CV / 31 / 2604.22339

Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM

Flow4DGS-SLAM:光流指导的4D高斯喷射SLAM
Wang, Yunsong, Lee, Gim Hee
Abstract
Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency.
Chinese Translation
在视觉同步定位与地图构建(SLAM)中,处理动态环境是一个重要的研究挑战。最近的研究将3D高斯喷射(3DGS)与SLAM结合,以实现稳健的相机位姿估计和逼真的渲染。然而,利用SLAM高效重建静态和动态区域仍然是一个挑战。在本研究中,我们提出了一种基于光流的动态3DGS SLAM高效框架。通过使用输入深度和先前光流,我们首先提出了一种类别无关的运动掩码生成策略,通过拟合相机自运动模型以分解光流。该模块将动态和静态高斯分开,并同时提供基于光流的相机位姿初始化。我们通过在关键帧显式建模动态3DGS的时序中心以提高训练速度。这些中心通过3D场景光流先验进行传播,并采用自适应插入策略进行动态初始化。同时,我们使用高斯混合模型(GMM)对时间透明度和旋转进行建模,以自适应地学习复杂的动态。实证结果展示了我们在跟踪、动态重建和训练效率方面的先进性能。
cs.CV / 32 / 2604.22350

PoseFM: Relative Camera Pose Estimation Through Flow Matching

PoseFM:通过流匹配实现相对相机姿态估计
Kuczkowski, Dominik, Ruotsalainen, Laura
Abstract
Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at https://github.com/helsinki-sda-group/posefm.
Chinese Translation
单目视觉里程计(VO)是计算机视觉中的一个基础问题,广泛应用于自主导航、增强现实等领域。尽管基于深度学习的方法在准确性上已相较于传统几何流程表现出优势,特别是在手工特征由于结构不佳或光照条件差而难以发挥作用的环境中,大多数方法仍依赖于确定性回归,这缺乏对鲁棒应用所需的不确定性意识。我们提出了PoseFM,首个将单目帧间VO重新表述为生成任务的框架,采用流匹配(Flow Matching,FM)技术。通过利用FM,我们将相机运动建模为一个分布,而非点估计,学习通过连续时间常微分方程(ODE)将噪声转化为真实的姿态预测。这种方法为不确定性估计提供了一个有原则的机制,并在具有挑战性的视觉条件下实现鲁棒的运动推断。在我们的评估中,PoseFM在TartanAir、KITTI和TUM-RGBD基准测试中表现出色,在某些轨迹上实现了最低的绝对轨迹误差(ATE),并且整体上与最好的帧间单目VO方法具有竞争力。代码和模型检查点将发布在 https://github.com/helsinki-sda-group/posefm。
cs.CV / 33 / 2604.22354

One Shot Learning for Edge Detection on Point Clouds

点云边缘检测的一次性学习
Tu, Zhikun, Zhang, Yuhe, Jia, Yiou, Li, Kang, Cohen-Or, Daniel
Abstract
Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.
Chinese Translation
每个扫描仪具有其独特的特性并表现出独特的采样误差分布。在包含来自不同扫描仪的数据的数据集上训练网络,其效果不如在特定于单一扫描仪的数据上进行训练。因此,我们提出了一种新颖的一次性学习方法,通过学习目标点云的特定数据分布来实现点云的边缘提取,从而获得优于在通用数据分布上训练的网络的效果。具体而言,我们展示了如何训练一个名为 OSFENet(一次性边缘特征提取网络)的轻量级网络,设计了一种基于过滤的 KNN 的表面补丁表示,支持一次性学习框架。此外,我们介绍了一个 RBF_DoS 模块,它集成了基于径向基函数的表面补丁描述符,对点云的边缘提取非常有利。通过与 ABC 数据集上的 7 个基准模型进行比较分析,证明了所提出的 OSFENet 的优势,并通过在多种真实扫描数据集(包括室内场景如 S3DIS 数据集,以及室外场景如 Semantic3D 数据集和 UrbanBIS 数据集)上的结果验证了其实际应用价值。
cs.CV / 34 / 2604.22379

Efficient Diffusion Distillation via Embedding Loss

通过嵌入损失进行高效的扩散蒸馏
Ying, Jincheng, Chen, Yitao, Wenlin, Li, Xu, Minghui, Xiao, Yinhao
Abstract
Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.
Chinese Translation
最近在将昂贵的扩散模型蒸馏为高效的少步生成器方面取得了显著进展,显示出了良好的前景。然而,这些方法通常需要大量的计算资源和较长的训练周期,限制了资源有限的研究人员的可及性,而现有的辅助损失函数也存在明显的局限性。回归损失要求在训练之前生成大量的数据集,并将学生模型限制于教师的性能,而基于生成对抗网络(GAN)的损失则面临训练不稳定的问题,需要仔细的调优。在本文中,我们提出了嵌入损失(Embedding Loss, EL),这是一种新颖的辅助损失函数,旨在补充现有的扩散蒸馏方法,以提高生成质量并加速小批量的训练。通过利用一组多样化的随机初始化网络的特征嵌入,EL有效地对齐了蒸馏的少步生成器和原始数据之间的特征分布。通过在嵌入特征空间中计算最大均值差(Maximum Mean Discrepancy, MMD),EL确保了稳健的分布匹配,从而在蒸馏过程中保持样本的保真度和多样性。在分布匹配蒸馏框架中,EL在一阶生成器表现出强大的经验性能。在CIFAR-10数据集上,我们的方法在无条件生成任务中达到了1.475的最先进的FID值,而在条件生成任务中达到了1.380。除了CIFAR-10外,我们还在多个基准和蒸馏方法(包括ImageNet、AFHQ-v2和FFHQ数据集)中进一步验证了EL,使用DMD、DI和CM蒸馏框架,展示了相较于现有的一阶蒸馏方法的一致性提升。我们的方法还将训练迭代次数减少了多达80%,为在资源有限的环境中部署基于扩散的生成模型提供了更为实用和可扩展的解决方案。
cs.CV / 35 / 2604.22388

HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

HFS-TriNet:一种用于从经直肠超声视频中分类前列腺癌的三分支协作特征学习网络
Lu, Xu, Peng, Qianhong, Zhou, Qihao, Liu, Shaopeng, Ye, Xiuqin, Yang, Chuan, Yuan, Yuan
Abstract
Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.
Chinese Translation
经直肠超声(TRUS)成像是一种经济有效且无创的诊断前列腺癌的方式。近年来,基于TRUS图像的计算机辅助诊断(CAD)已被广泛研究。与静态图像相比,TRUS视频提供了更丰富的时空信息,这使其成为提高CAD系统准确性和鲁棒性的有希望的替代方案。然而,TRUS视频分析也带来了新的挑战,包括信息冗余,增加了计算成本;高的类内和类间相似性,复杂化了特征提取;以及低信噪比,妨碍临床相关信息的识别。为了解决这些问题,我们提出了一种启发式帧选择(HFS)和一种三分支协作特征学习网络(HFS-TriNet)用于从TRUS视频中分类前列腺癌。具体而言,以间隔选择视频帧片段进行训练可以减轻冗余。HFS策略动态初始化每个训练片段的起始点,确保所采样的片段覆盖整个视频序列。为了更好地进行特征提取,除了常规的ResNet50分支外,我们还利用了1)一个基于预训练医疗分割模型(Segment Anything Model, SAM)的大模型分支,以提取每帧的深层特征,并使用基于归一化的注意模块探索时间一致性;以及2)一个小波变换卷积残差(Wavelet Transform Convolutional Residual, WTCR)分支,该分支提取高频域中的病灶边缘信息,并在低频域中进行去噪。
cs.CV / 36 / 2604.22390

Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition

区域至关重要:高效可靠的区域感知视觉地点识别
Chen, Shunpeng, Song, Yukun, Wang, Changwei, Xu, Rongtao, Fu, Kexue, Gao, Longxiang, Guo, Li, Wang, Ruisheng, Xu, Shibiao
Abstract
Visual Place Recognition (VPR) determines a query image's geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at https://github.com/chenshunpeng/FoL.
Chinese Translation
视觉地点识别(Visual Place Recognition, VPR)通过将查询图像与地理标记数据库进行匹配来确定其地理位置。然而,现有方法在应对无关区域造成的感知别名问题以及由于严苛候选调度引起的低效重排序方面存在困难。为了解决这些问题,我们提出了FoL++,一种结合稳健区分区域建模与自适应重排序的方法。具体而言,我们提出了一个可靠性估计分支以生成空间可靠性图,明确建模遮挡抵抗。该表示通过两个空间对齐损失(空间对齐损失SAL和空间分类增强损失SCEL)进一步优化,以有效对齐特征并突出显著区域。对于没有人工标注的弱监督学习,伪对应策略直接从聚合集群生成密集局部特征监督。我们的自适应候选调度器根据全局相似性动态调整候选池的大小。通过按可靠性加权局部匹配并自适应融合全局与局部证据,FoL++超越了传统的独立匹配系统。在七个基准测试中的广泛实验表明,FoL++以轻巧的内存占用实现了最先进的性能,推理速度比FoL提高了40%。代码和模型将被发布(并与FoL合并)在 https://github.com/chenshunpeng/FoL.
cs.CV / 37 / 2604.22409

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

SpaMEM:通过感知-记忆融合在具身环境中基准测试动态空间推理
Liao, Chih-Ting, Xiao, Xi, Meng, Chunlei, Chen, Zhangquan, Qiao, Yitong, Zhou, Weilin, Wang, Tianyang, Zheng, Xu, Cao, Xin
Abstract
Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
Chinese Translation
多模态大语言模型(MLLMs)在静态视觉-空间推理方面取得了进展,但它们在具身环境中常常无法保持长期空间一致性,因为在环境变化下,信念必须不断从自我中心的观察中进行修正。我们提出了SpaMEM(基于行动序列的空间记忆),这是一个大规模的诊断基准,旨在通过基于行动的场景转变(生成、放置、移除)在长交互时间范围内隔离空间信念演变的机制。SpaMEM建立在一个物理基础的数据集之上,该数据集包含10,601,392幅高保真图像,涵盖四种模态(RGB、深度、实例、语义分割),这些图像来自于1,000个程序生成的房屋中的25,000多个交互序列。我们将具身空间推理形式化为一个三级层次结构,其中包含15个诊断任务:第一层测量来自单一观察的原子空间感知;第二层通过神谕文本状态历史探测时间推理,以消除感知噪声;第三层要求在相同任务维度下从原始视觉流中进行端到端的信念维护。我们进一步评估了短期(逐步)更新和长期(情节)重建。对代表性开源视觉语言模型(VLM)家族进行基准测试揭示了一个一致的堆叠瓶颈:坐标一致性基础仍然是一个硬性上限,而从第二层到第三层的急剧崩溃暴露了显著的符号搭建依赖性,模型在文本基础的记账中取得成功,但难以维持稳健的视觉记忆。SpaMEM提供了一个细粒度的诊断标准,并激励明确的状态表示、信念修正和长期情节集成机制。
cs.CV / 38 / 2604.22439

NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting

NRGS:用于鲁棒3D语义高斯点云的神经正则化
Yang, Zaiyan, Liu, Xinpeng, Guo, Heng, Shi, Jinglei, Ma, Zhanyu, Okura, Fumio
Abstract
We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.
Chinese Translation
我们提出了一种神经正则化方法,用于细化由多视图不一致的二维特征提升所产生的嘈杂3D语义场,从而获得准确且鲁棒的3D语义高斯点云。由于缺乏跨视图约束,从视觉基础模型提取的二维特征存在多视图不一致性。直接将这些不一致的特征提升为3D高斯会导致嘈杂的语义场,从而降低下游任务的性能。以往的方法要么专注于在预处理阶段获取一致的多视图特征,要么旨在通过改进优化策略来减轻噪声,通常以增加的预处理时间或昂贵的计算开销为代价。相比之下,我们引入了一种变异感知的条件多层感知机(MLP),该模型直接作用于3D高斯,利用其几何和外观属性来纠正3D空间中的语义错误。在不同数据集上的实验表明,我们的方法提升了提升语义的准确性,为鲁棒的3D语义高斯点云提供了一种高效有效的解决方案。
cs.CV / 39 / 2604.22476

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

聚焦工作流程:自动化和高效的视频流事件发现
Pegoraro, Marco, Seng, Jonas, Heller, Dustin, van der Aalst, Wil M. P., Kersting, Kristian
Abstract
Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.
Chinese Translation
业务流程管理和过程挖掘等学科通过发现录制事件数据中关于流程的见解来帮助组织。然而,影响流程分析的一个障碍是数据的多模态性:例如,视频形式的数据并不能直接被解释为事件。在本研究中,我们提出了SnapLog,一种通过使用图像嵌入将帧转换为特征向量,并通过逐帧相似性矩阵进行时间分割,从视频中提取事件数据的方法。接着,我们使用一种广义的少量样本分类方法为视频片段分配标签,从而生成可以解释为事件的带标签、带时间戳的帧子序列。传统的过程挖掘技术可以用于分析所产生的数据。我们展示了我们的方法生成的日志准确反映了视频中的过程。
cs.CV / 40 / 2604.22477

Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples

对比语义投影:使用对比示例进行可信的神经元标记
Bouanani, Oussama, Berend, Jim, Samek, Wojciech, Lapuschkin, Sebastian, Dreyer, Maximilian
Abstract
Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
Chinese Translation
神经元标记是将文本描述分配给深度网络的内部单元。现有的方法通常依赖于高激活示例,往往通过关注主导但偶然的视觉因素,导致模糊或误导性的标签。之前的工作如 FALCON 引入了对比示例——与激活示例在语义上相似,但激活值较低的输入——以增强解释能力,但它主要关注子空间级解释,而非可扩展的神经元级标记。我们对神经元级标记的对比解释进行了两阶段回顾:(1) 使用视觉语言模型(VLMs)生成候选标签;(2) 使用 CLIP 类编码器进行标签分配。首先,我们展示了向 VLMs 提供对比图像集能够产生更具体且更可信的候选标签。其次,我们引入了对比语义投影(Contrastive Semantic Projection,CSP),这是一种扩展 SemanticLens 的方法,直接将对比示例融入其基于 CLIP 的评分和选择流程。在大量实验和黑色素瘤检测的案例研究中,对比标记在可信度和语义细粒度上均优于最先进的方法。我们的结果表明,对比示例是神经元标记和分析流程中一种简单而强大且当前尚未充分利用的组成部分。
cs.CV / 41 / 2604.22479

Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification

通过个性化的眼睛比例(EAR)/嘴巴比例(MAR)阈值和基于卷积神经网络(CNN)的分类提高驾驶员疲劳检测
Ersoy, Gökdeniz, Tatar, Mehmet Alper, Tonbul, Eray, Kırbız, Serap
Abstract
Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.
Chinese Translation
驾驶员疲劳是全球交通事故的主要原因,严重威胁公共安全。基于视觉的驾驶员监控系统通常依赖固定的眼睛比例(EAR)和嘴巴比例(MAR)阈值;然而,由于面部结构、光照和驾驶条件的差异,这些固定值常常无法在不同个体间推广。本文提出了一种个性化的驾驶员疲劳检测系统,该系统实时监测眼睑运动、头部位置和打哈欠行为,并在检测到疲劳迹象时发出警告。该系统采用在驾驶前校准的特定于驾驶员的EAR和MAR阈值,以提高传统度量基础的检测效果。此外,还整合了基于深度学习的卷积神经网络(CNN)模型,以提高在复杂场景中的准确性。该系统使用公开可用的数据集以及在多种光照条件、头部姿势和用户特征下收集的定制数据集进行评估。实验结果表明,与固定阈值相比,个性化阈值提高了检测准确性2-3%,而基于CNN的分类在眼状态检测中达到了99.1%的准确率,在打哈欠检测中达到了98.8%的准确率,充分体现了将传统度量与深度学习相结合以实现鲁棒的实时驾驶员监控效果的有效性。
cs.CV / 42 / 2604.22482

Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

Holo360D:一个用于推进全景3D重建及其扩展的大规模真实世界数据集,具有连续轨迹
Ou, Jing, Cao, Zidong, Ren, Yinrui, Li, Zhuoxiao, Zhu, Jinjing, Hua, Tongyan, Zhang, Shuai, Xiong, Hui, Zhao, Wufan
Abstract
While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.
Chinese Translation
尽管前馈3D重建模型迅速发展,但由于球面失真,它们在处理全景图像时仍表现出性能衰退。此外,现有的全景3D数据集主要是在固定的离散位置使用360度相机收集的,导致轨迹不连续。这些限制严重阻碍了全景前馈3D重建的发展,特别是在多视角设置下。本文提出了Holo360D,一个综合性数据集,包含109,495个与注册点云、网格和对齐相机姿态配对的全景图。根据我们的知识,Holo360D是第一个提供连续全景序列,并且准确对齐高完整性深度图的大规模数据集。原始数据最初是使用与360度相机相结合的3D激光扫描仪收集的。随后,原始数据通过在线和离线SLAM系统进行处理。此外,为了提高3D数据质量,提出了一种针对360度数据集的后处理管道,包括几何去噪、网格孔填充和区域特定的重网格化。最后,我们通过对Holo360D上的3D重建模型进行微调,建立了一个新的基准,为有效的微调策略提供了重要见解。我们的结果表明,Holo360D提供了优越的训练信号,为推进全景3D重建模型提供了全面的基准。数据集和代码将公开发布。
cs.CV / 43 / 2604.22498

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

CGC:用于细粒度多图像理解的组合基准对比
Zheng, Lihao, Shao, Zhenwei, Zhou, Yu, Yang, Yan, Shen, Xintian, Chen, Jiawei, Ma, Hao, Wei, Tao
Abstract
Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
Chinese Translation
尽管多模态大型语言模型(MLLMs)迅速发展,但在细粒度多图像理解方面仍面临显著挑战,通常表现出空间幻觉、注意力泄漏以及对象恒常性的失败。此外,现有方法通常依赖昂贵的人类注释或大规模的思维链(CoT)数据生成。我们提出了组合基准对比(缩写为 CGC),这是一个低成本的完整框架,旨在提升 MLLMs 的细粒度多图像理解能力。CGC 基于现有的单图像基础注释,通过图像间对比(Inter-Image Contrast)和图像内对比(Intra-Image Contrast)构造组合多图像训练实例,分别引入语义解耦的干扰上下文以进行跨图像区分,以及相关的跨视角样本以实现对象恒常性。CGC 还在 GRPO 框架内引入了一种基于规则的空间奖励,以提高源图像的归属、空间对齐和结构输出的有效性,这一切均遵循“先思考再定位”的范式。实验表明,CGC 在细粒度多图像基准测试(包括 MIG-Bench 和 VLM2-Bench)上达到了最先进的结果。所学习的多图像理解能力还可以迁移到更广泛的多模态理解和推理任务上,使得相较于 Qwen3-VL-8B 基础模型,MathVista(+2.90)、MuirBench(+2.88)、MMStar(+1.93)、MMMU(+1.77)和 BLINK(+1.69)等任务中获得了一致性提升。
cs.CV / 44 / 2604.22506

ICPR 2026 Competition on Low-Resolution License Plate Recognition

2026年国际模式识别大会低分辨率车牌识别竞赛
Laroca, Rayson, Nascimento, Valfride, Kim, Donggun, Chung, Sanghyeok, Bae, Subin, Seo, Uihwan, Oh, Seungsang, Phung, Chi M., Vo, Minh G., Ye, Xingsong, Du, Yongkun, Su, Yuchen, Chen, Zhineng, Heo, Sunhee, Lee, Hyangwoo, Na, Kihyun, Nguyen, Khanh V. Vu, Pham, Sang T., Phung, Duc N. N., Le, Trong P., Tran, Vy N. Vo, Menotti, David
Abstract
Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at https://icpr26lrlpr.github.io/
Chinese Translation
低分辨率车牌识别(Low-Resolution License Plate Recognition, LRLPR)在实际监控场景中仍然是一个具有挑战性的问题,长距离拍摄、压缩伪影以及不利的成像条件会严重影响车牌的可读性。为了促进这一领域的进展,我们组织了2026年国际模式识别大会低分辨率车牌识别竞赛,这是第一个专门针对LRLPR的竞赛,使用了在实际相关条件下收集的真实低质量数据。该竞赛基于LRLPR-26数据集,该数据集包含20,000条训练轨迹和3,000条测试轨迹;每条训练轨迹包含同一车牌的五张低分辨率图像和五张高分辨率图像。值得注意的是,共有来自41个国家的269个团队注册参加此次竞赛,99个团队在盲测阶段提交了有效作品。获胜团队的识别率达到82.13%,并且有四个团队超过了80%的门槛,突显了排行榜前端激烈的竞争水平以及该任务的持续困难。除了介绍竞赛设计、评估协议和主要结果外,本文还总结了前五名团队所采用的方法,并讨论了LRLPR研究的当前趋势和未来有希望的方向。竞赛网页地址为:https://icpr26lrlpr.github.io/
cs.CV / 45 / 2604.22507

Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain

铁路人工智能学习基准 (RAIL-BENCH): 铁路领域感知的基准套件
Bätz, Annika, Klasek, Pavel, Ham, Seo-Young, Neumaier, Philipp, Köppel, Martin, Lauer, Martin
Abstract
Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (https://www.mrt.kit.edu/railbench). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.
Chinese Translation
在现有铁路基础设施上进行自动化列车操作需要强大的基于摄像头的感知能力,但铁路领域缺乏公共基准套件和标准化评估协议,无法实现方法的可重复比较。我们提出了RAIL-BENCH,这是第一个用于铁路领域的感知基准套件。它包含五个挑战:铁路轨道检测、物体检测、植被分割、多目标跟踪和单目视觉里程计,每个挑战都针对铁路环境的特定特征进行了定制。RAIL-BENCH提供了从多样化的现实场景中提取的经过筛选的训练和测试数据集、评估指标,以及公共得分板(https://www.mrt.kit.edu/railbench)。对于铁路轨道检测挑战,我们引入了LineAP,这是一种新颖的基于段的平均精度指标,能够独立于实例级分组评估多段预测的几何准确性,解决了现有线检测指标的关键局限性。
cs.CV / 46 / 2604.22515

Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts

因人而异:历史阿拉伯手稿的作者识别
Abushahla, Hamza A., Panopio, Ariel Justine N., Al-Khairulla, Layth, AlHajri, Mohamed I.
Abstract
Handwritten Arabic manuscripts preserve the Arab world's intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset's labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
Chinese Translation
手写的阿拉伯手稿保留了阿拉伯世界的知识和文化遗产,而作者识别则支持来源、真实性验证和历史分析。利用历史阿拉伯手稿的Muharaf数据集,我们评估了单行图像中的作者识别,并且据我们所知,提供了首次在行级和页面非重叠评估协议下报告的基准。由于该数据集仅部分标注了作者识别,我们手动验证并扩展了公共部分的作者标签,从6,858行(28.00%)扩展到24,495行中的21,249行(86.75%),纠正了不一致性,并移除了非手写文本。经过进一步筛选,我们保留了18,987行(77.51%)。我们提出了一种基于卷积神经网络(CNN)的模型,结合注意力机制用于封闭集作者识别,包括将稀有的两位作者行建模为复合作者对类。我们基准测试了十四种配置,并在不同的特征提取器和训练模式之间进行了消融实验。为了评估对未见页面的泛化能力,页面非重叠协议将每页中的所有行分配给单一的分割。在行级协议下,经过微调的DenseNet201结合注意力机制实现了99.05%的Top-1准确率、99.73%的Top-5准确率和97.44%的F1-score。在更具挑战性的页面非重叠协议下,观察到的最佳结果为78.61%的Top-1准确率、87.79%的Top-5准确率和66.55%的F1-score,从而量化了页面级线索的影响。通过扩展Muharaf数据集中标记子集并报告两个协议,我们提供了更清晰的基准,并为参与文化和历史重要文件的历史学家和语言学家提供了实用资源。代码和实现细节可在GitHub上获取。
cs.CV / 47 / 2604.22518

Non-Minimal Sampling and Consensus for Prohibitively Large Datasets

非最小采样与共识算法用于极大数据集
Lee, Seong Hun, Vandewalle, Patrick, Civera, Javier
Abstract
We introduce NONSAC (Non-Minimal Sampling and Consensus), a general framework for robust and scalable model estimation from arbitrarily large datasets contaminated with noise and outliers. NONSAC repeatedly samples non-minimal subsets of data and generates model hypotheses using a robust estimator, producing multiple candidate models. The final model is selected based on a predefined scoring rule that evaluates hypothesis quality. Our framework is estimator-agnostic and can be integrated with existing geometric fitting algorithms such as RANSAC to improve both scalability and robustness to outliers. We propose and evaluate various scoring rules for NONSAC on relative camera pose estimation, Perspective-n-Point, and point cloud registration. Furthermore, we showcase the applicability of NONSAC to correspondence-free point cloud registration by hypothesizing all-to-all correspondences.
Chinese Translation
我们介绍了非最小采样与共识算法(NONSAC),这是一个用于从被噪声和离群值污染的任意大数据集中进行鲁棒且可扩展模型估计的通用框架。NONSAC 重复地对非最小数据子集进行采样,并使用鲁棒估计器生成模型假设,产生多个候选模型。最终模型的选择基于预定义的评分规则,该规则用于评估假设的质量。我们的框架与估计器无关,并且可以与现有的几何拟合算法(如 RANSAC)集成,以提高可扩展性和对离群值的鲁棒性。我们提出并评估了多种用于 NONSAC 的评分规则,应用于相对相机位姿估计、透视-n-点和点云配准。此外,我们通过假设全到全的对应关系,展示了 NONSAC 在无对应点云配准中的适用性。
cs.CV / 48 / 2604.22529

Distilling Vision Transformers for Distortion-Robust Representation Learning

用于抗失真表征学习的视觉变换器蒸馏
Alexis, Konstantinos, Giannopoulos, Giorgos, Gunopulos, Dimitrios
Abstract
Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.
Chinese Translation
自监督学习在从干净数据中学习视觉表征方面取得了显著成功,但在干净观察稀缺或完全不可用时仍然面临挑战。本文展示了如何利用预训练的视觉模型来学习抗失真的表征,这些表征可以有效应用于处理失真观察的下游任务。特别地,我们提出了一个不对称知识蒸馏框架,其中教师和学生均从同一预训练的视觉变换器(Vision Transformer)初始化,但接收每幅图像的不同视图:教师处理干净图像,而学生则看到其失真版本。我们引入了多层次蒸馏,旨在对齐全局嵌入、块级特征和注意力图,并展示学生能够逼近干净图像的表征,尽管从未直接访问干净数据。我们在多个数据集和各种失真条件下评估了我们的方法,在图像分类任务中持续超越了现有的同类方法,其人类监督量相同。
cs.CV / 49 / 2604.22539

Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals

学术制图中的主题地图设计演变:基于多语言期刊的三十年研究
Wei, Zhiwei, Song, Chenxi, Wang, Tazhu, Wu, Fan, Liao, Hua, Ding, Su, Yang, Nai
Abstract
Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.
Chinese Translation
主题地图在学术交流中发挥着核心作用,但其大规模设计演变鲜有实证研究。本文呈现了1990年至2020年间学术制图中主题地图设计实践的纵向和多语言分析。我们汇编了来自16本权威中英文期刊的45,732篇研究文章,并利用计算机视觉和基于大模型的文档解析技术提取了23,928张地图,以构建结构化数据集。地图设计特征在三个维度上进行了量化:地图元素、颜色设计和布局结构。结果表明,中英文学术地图在结构规范上高度相似,通常采用中性色调为主的克制色彩搭配,饱和度低、亮度高、色相多样性有限,且布局集中,主地图的占比率高。不同之处在于英文地图在色相丰富性和紧凑性上略胜一筹,而中文地图历史上则更依赖于中性色调和综合布局。时间分析揭示了两组之间平行的演变趋势,包括元素丰富性、图例使用以及色相多样性的增加,同时布局结构保持稳定。总体而言,研究结果表明,学术地图设计的演变更受制度趋同而非文化差异的影响。
cs.CV / 50 / 2604.22546

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

ReLIC-SGG:开放词汇场景图生成中的关系格补全
Hosseini, Amir, Farahani, Sara, Li, Xinyi, Guang, Suiyang
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
Chinese Translation
开放词汇场景图生成(SGG)旨在以灵活的关系短语描述视觉场景,超越固定的谓词集合。现有方法通常将标注的三元组视为正例,而将所有未标注的对象对关系视为负例。然而,场景图的标注本质上是不完整的:许多有效的关系缺失,同一交互可以以不同的粒度进行描述,例如 extit{on}(在……上)、 extit{standing on}(站在……上)、 extit{resting on}(靠在……上)和 extit{supported by}(由……支撑)。在开放词汇SGG中,由于关系空间大幅扩大,这一问题变得更加严重。我们提出了 extbf{ReLIC-SGG},一个关注关系不完整性的框架,认为未标注的关系是潜在变量,而非确定的负例。ReLIC-SGG构建了一个语义关系格,以建模开放词汇谓词之间的相似性、蕴含和矛盾,并利用它从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。一个正例-未标记图学习目标进一步减少了假阴性监督,而基于格的解码产生了紧凑且语义一致的场景图。在常规、开放词汇和全景SGG基准上的实验表明,ReLIC-SGG在识别稀有和未见谓词方面有所提升,并能够更好地恢复缺失的关系。
cs.CV / 51 / 2604.22552

Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models

可转移的物理世界对抗补丁针对行人检测模型
Yan, Shihui, Zhou, Ziqi, Song, Yufei, Hu, Yifan, Li, Minghui, Hu, Shengshan
Abstract
Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.
Chinese Translation
物理对抗补丁攻击对行人检测构成严重威胁,使监控和自动驾驶系统无法识别行人,从而带来严重的安全风险。尽管在受控环境中效果显著,现有的物理攻击在实际应用中面临两个主要局限:一是缺乏对多阶段决策流程的系统性干扰,使得残余模块能够抵消扰动;二是未能有效模拟复杂的物理变化,导致鲁棒性不足。为了克服这些局限,我们提出了一种新颖的行人对抗补丁生成方法,结合多阶段协同攻击与在物理多样性下的鲁棒性增强,称为TriPatch。具体而言,我们设计了一种三元组损失,包括检测置信度抑制、边框偏移放大和非极大值抑制(NMS)干扰,这些都在检测流程的不同阶段共同作用。此外,我们引入了外观一致性损失,以约束补丁的颜色分布,从而提高其在不同成像条件下的适应性,并结合数据增强进一步增强对复杂物理扰动的鲁棒性。大量实验表明,与现有方法相比,TriPatch在多个检测器模型上实现了更高的攻击成功率。
cs.CV / 52 / 2604.22554

Video Analysis and Generation via a Semantic Progress Function

通过语义进展函数进行视频分析与生成
Metzer, Gal, Polaczek, Sagi, Mahdavi-Amiri, Ali, Giryes, Raja, Cohen-Or, Daniel
Abstract
Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.
Chinese Translation
图像和视频生成模型产生的变换通常以高度非线性的方式演变:内容几乎不变的长时间段往往伴随突然而剧烈的语义跳跃。为了分析和纠正这种行为,我们引入了一种语义进展函数(Semantic Progress Function),这是一种一维表示,捕捉给定序列的意义随时间演变的过程。对于每一帧,我们计算语义嵌入之间的距离,并拟合一条平滑曲线,反映序列中累积的语义变化。这条曲线偏离直线的程度揭示了语义节奏的不均匀。基于这一洞察,我们提出了一种语义线性化过程,重新参数化(或重新时间化)序列,使得语义变化以恒定速率展开,从而产生更平滑和更连贯的过渡。除了线性化,我们的框架还提供了一个无模型的基础,用于识别时间不规律性,比较不同生成器之间的语义节奏,以及引导生成的和真实世界的视频序列朝向任意的目标节奏。
cs.CV / 53 / 2604.22560

Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

层次驾驶视觉问答中的跨阶段一致性:显式基线与学习的门控上下文投影器
Jain, Gautam Kumar, Markgraf, Carsten, Stähler, Julian
Abstract
Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
Chinese Translation
图形视觉问答(GVQA)针对自主驾驶将推理组织为有序的阶段,包括感知、预测和规划,其中规划决策应与模型自身的感知保持一致。我们在DriveLM-nuScenes上进行了跨阶段上下文传递的比较研究,使用两种互补机制。显式变体评估了基于提示的三种条件策略,这些策略在经过域适应的4B视觉语言模型(Mini-InternVL2-4B-DA-DriveLM)上运行,无需额外训练,减小了NLI(自然语言推理)矛盾达42.6%,并建立了一个强有力的零训练基线。隐式变体引入了门控上下文投影器,这些投影器从一个阶段提取隐藏状态向量,并将归一化的门控投影注入到下一个阶段的输入嵌入中。这些投影器与阶段特定的QLoRA适配器在一个通用的8B视觉语言模型(InternVL3-8B-Instruct)上进行联合训练,同时仅更新约0.5%的参数。隐式变体在规划阶段NLI矛盾上实现了统计显著的34%减少(引导自助法95%置信区间,p < 0.05),并且跨阶段的蕴含性提高了50%,这一结果是通过多语言NLI分类器评估的,以考虑混合语言输出。规划语言的质量也有所提高(CIDEr +30.3%),但由于缺乏驾驶领域的预训练,词汇重叠和结构一致性有所下降。由于这两种变体使用不同的基础模型,我们将它们呈现为互补的案例研究:显式上下文传递为表面一致性提供了强大的无训练基线,而隐式门控投射则在规划阶段带来了显著的语义增益,这表明域适应可能是全面改进的下一步关键。
cs.CV / 54 / 2604.22586

FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

FlowAnchor:稳定无反演视频编辑信号
Chen, Ze, Chen, Lan, Li, Yuanhang, Mao, Qi
Abstract
We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.
Chinese Translation
我们提出了FlowAnchor,这是一个无训练的框架,用于稳定和高效的无反演、基于流的视频编辑。无反演编辑方法最近在图像处理上展示了令人印象深刻的效率和结构保留,通过直接操控采样轨迹与编辑信号进行编辑。然而,将这一范式扩展到视频仍然面临挑战,常常在多物体场景或增加帧数的情况下失败。我们认为其根本原因在于高维视频潜在空间中编辑信号的不稳定性,这源于不精确的空间定位和长度引起的幅度衰减。为了克服这一挑战,FlowAnchor 明确地锚定了编辑位置及编辑强度。它引入了空间注意力细化(Spatial-aware Attention Refinement),该方法强制文本指导与空间区域之间的一致对齐,并采用自适应幅度调制(Adaptive Magnitude Modulation),以自适应地保持足够的编辑强度。这些机制共同稳定了编辑信号,引导基于流的演变朝向所需的目标分布。大量实验表明,FlowAnchor在多物体和快速运动场景中实现了更真实、时间一致且计算高效的视频编辑。项目页面可访问:https://cuc-mipg.github.io/FlowAnchor.github.io/
cs.CV / 55 / 2604.22595

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

EV-CLIP:在视觉挑战下进行少样本动作识别的 CLIP 高效视觉提示适应
Jon, Hyo Jin, Jin, Longbin, Kim, Eun Yi
Abstract
CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.
Chinese Translation
CLIP 在视觉领域通过自然语言监督展现出强大的泛化能力,甚至在视频动作识别方面也如此。然而,现有大多数适应 CLIP 进行动作识别的方法主要集中于时间建模,往往忽视了空间感知。在现实场景中,低光环境或自我中心视角等视觉挑战可能严重影响空间理解,而这对于有效的时间推理至关重要。为了解决这一局限性,我们提出了针对 CLIP 的高效视觉提示(EV-CLIP),这是一个专为多样化场景和视角下的少样本视频动作识别设计的高效适应框架。EV-CLIP 引入了两种视觉提示:掩码提示,通过重新加权像素来引导模型的注意力集中在与动作相关的区域,以及上下文提示,通过将逐帧特征压缩为紧凑表示来执行轻量级的时间建模。为了进行全面评估,我们整理了五个基准数据集,并分析领域转移,以量化多样化视觉和语义因素对动作识别的影响。实验结果表明,EV-CLIP 在整体性能上超过了现有的参数高效方法。此外,其效率不依赖于主干网络的规模,使其非常适合在现实世界资源受限的场景中部署。代码可在 https://github.com/AI-CV-Lab/EV-CLIP 获得。
cs.CV / 56 / 2604.22657

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock

一种非侵入式的RFID替代方案:自给自足的群体养殖牲畜三维识别
Paudel, Shiva, Tsai, TsungCheng, Wang, Dongyi
Abstract
Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.
Chinese Translation
在群体养殖环境中准确识别个体农场动物是精准农业管理的基石。然而,现行行业标准过于依赖无线射频识别(RFID)耳标,这种方法具有侵入性、易于丢失,并受到天线场空间限制的影响。本文提出了一种非侵入式的基于视觉的识别系统,利用在商业电子喂养站(EFS)中捕获的三维点云数据。与传统的监督帧级推断方法不同,我们引入了时序自适应识别架构(Temporal Adaptive Recognition Architecture, TARA),这是一种自给自足的半监督框架,旨在保持身份的一致性。TARA采用动态重校准机制,更新个体身份档案,以适应牲畜的形态变化。为了在标签稀缺的环境中促进训练,我们利用访问级多数投票策略从原始时间序列生成高保真度的伪标签。我们在一个实际商业养殖场收集的群体养殖母猪数据集上的实验结果表明,我们的方法在访问级别达到了100%的识别准确率。这些结果表明,基于视觉的三维点云分析提供了一种强大且优越的替代RFID系统的方案,为完全自主的个体动物监测铺平了道路。
cs.CV / 57 / 2604.22658

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

PASR: 基于姿势感知的遮挡单视图3D形状检索
Shi, Jiaxin, Zhang, Guofeng, Ma, Wufei, Liang, Naifu, Kortylewski, Adam, Vuile, Alan
Abstract
Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
Chinese Translation
单视图3D形状检索是一项基本但具有挑战性的任务,随着可用3D数据的增长,其重要性日益增强。现有方法主要分为两类:一类是使用对比学习将点云特征映射到现有的视觉-语言空间,另一类是为2D图像和3D形状学习一个通用嵌入空间。然而,这些前馈的整体对齐方法通常难以解释,从而限制了它们在实际应用中的鲁棒性和泛化能力。为了解决这一问题,我们提出了基于姿势感知的3D形状检索框架(PASR),该框架将检索问题表述为一个通过综合分析进行特征级分析的问题,通过将来自2D基础模型(DINOv3)的知识提炼到3D编码器中。通过将姿势条件的3D投影与2D特征图对齐,我们的方法架起了现实世界图像与合成网格之间的桥梁。在推理过程中,PASR通过分析-综合过程执行测试时优化,联合搜索最佳重构输入图像的块级特征图的形状和姿势。这种基于综合的优化对部分遮挡本质上具有鲁棒性,并且对细粒度几何细节敏感。PASR在干净和遮挡的3D形状检索数据集上均显著超越现有方法。此外,PASR还展示出强大的多任务能力,在单一框架内实现了鲁棒的形状检索、具有竞争力的姿势估计和准确的类别分类。
cs.CV / 58 / 2604.22686

SS3D: End2End Self-Supervised 3D from Web Videos

SS3D:端到端自监督网络视频生成的三维重建
Hariat, Marwane, Franchi, Gianni, Filliat, David, Manzanera, Antoine
Abstract
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
Chinese Translation
我们提出了SS3D,一种基于网络规模的结构光重建(SfM)自监督预训练管道,用于从单目视频中进行前馈三维估计。我们的模型在单次前向传递中共同预测深度、自身运动和内部参数,并作为一个一致的端到端三维估计器进行训练和评估。为了稳定联合学习,我们采用了以内部参数为首的两阶段计划和统一的单检查点评估协议。将SfM自监督扩展到无约束的网络视频面临挑战,因为存在较弱的多视图可观察性和强烈的语料库异质性;我们通过使用多视图信号代理(MVS)进行过滤和课程采样来解决这些问题,并将专家训练提炼为一个单一的学生。针对YouTube-8M(经过过滤后约1亿帧)进行的预训练实现了强大的跨领域零样本迁移和优于前期自监督基准的改进微调性能。我们发布了预训练的检查点和代码。
cs.CV / 59 / 2604.22700

Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model

基于4D纵向扩散模型的神经退行性脑解剖生成建模
Jayakumar, Nivetha, Deb, Swakshar, Jafrasteh, Bahram, Zhao, Qingyu, Zhang, Miaomiao
Abstract
Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.
Chinese Translation
理解和预测神经退行性疾病的发展仍然是医学人工智能面临的重大挑战,这对于早期诊断、疾病监测和治疗规划具有重要意义。然而,现有的大多数纵向神经影像数据集时间上稀疏,每个受试者只有少量的随访扫描。这种时间数据的稀缺限制了我们对与疾病进展相关的个体解剖变化的建模和准确捕捉。为了解决这个问题,我们提出了一种新颖的基于4D(3D × 时间)扩散的生成框架,该框架有效地建模和合成随着时间变化的纵向脑解剖,条件是可用的临床变量,如健康状态、年龄、性别及其他相关因素。此外,尽管当前大多数方法集中在处理图像强度或纹理上,我们的方法明确学习了保持拓扑结构的时空变形的数据分布,以有效捕捉脑结构随时间变化的几何变化。这一设计使得对未来解剖状态的真实生成以及解剖一致的疾病轨迹重建成为可能,从而提供了对纵向脑变化更真实的表征。我们通过合成序列生成、下游纵向疾病分类以及脑分割验证了我们的模型。在两个大规模纵向神经影像数据集上的实验表明,我们的方法在生成解剖准确、时间一致且临床有意义的脑轨迹方面优于最先进的基线方法。我们的代码已在Github上发布。
cs.CV / 60 / 2604.22714

Long-tail Internet photo reconstruction

长尾互联网照片重建
Li, Yuan, Xiangli, Yuanbo, Averbuch-Elor, Hadar, Snavely, Noah, Cai, Ruojin
Abstract
Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
Chinese Translation
互联网照片集合表现出极其长尾的分布:少数著名地标被密集拍摄且易于进行三维重建,而大多数现实场景则由稀疏、噪声和不均匀的图像表示,超出了经典和学习型三维方法的能力。我们认为,解决这一长尾现象代表了三维基础模型的下一个前沿。尽管从稀疏场景中获取可靠的真实三维监督是具有挑战性的,但我们观察到,可以通过从重构良好的互联网地标中抽样稀疏子集来有效模拟。为此,我们引入了MegaDepth-X,一个包含干净、密集深度的三维重建的大型数据集,以及一种抽样训练图像集的策略,该策略模拟长尾场景中的相机分布。通过这些组件对三维基础模型进行微调,能够在极端稀疏的情况下获得稳健的重建,并且在对称和重复场景中的重建也更加可靠,同时保持对标准的、密集的三维基准数据集的良好泛化能力。
cs.CV / 61 / 2604.22739

Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis

跨态度:用于对话态度分析的二人多模态语料库
Zhang, Xiang, Li, Xiaotian, Wang, Taoyue, Bi, Nan, Zhou, Xin, Zhou, Cody, Wang, Zoie, Yang, Andrew, Su, Yuming, Cohn, Jeff, Ji, Qiang, Yin, Lijun
Abstract
Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other's postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.
Chinese Translation
社会互动支配着我们对世界的认知,并通过将社会意义附加到简单和自发的行为上,如手势、面部表情、声音和语言,塑造我们的日常行为。人们模仿和回应彼此的姿态、面部表情、举止以及其他言语和非言语行为,并在此过程中形成评价或评估。然而,目前没有公开可用的数据集包含多重个体在社会互动中的多模态录音和自我报告测量。缺乏二人录音和注释。我们展示了一个新的多模态二人互动数据集(45对,90人),该数据集包括同步的多模态行为(2D面部视频、3D面部几何、热谱动态、声音和语言行为、身体生理(PPG、EDA、心率、血压和呼吸)以及所有参与者在交流互动场景中的自我报告情感。该数据集包含两种类型的二人对:具有共同历史的人和陌生人。注释包括社会信号、同意、不同意和中立态度。通过强有力的情感诱导,这些多模态数据将使对多模态人际行为的新模型建立成为可能。我们进行了广泛的实验,以评估有无人际历史的二人之间的多模态交流及其情感。这个新数据库将使社会互动的多模态建模成为前所未有的可能。该数据集包含20TB的多模态数据,供研究社区共享。
人工智能 (Artificial Intelligence)
18
cs.AI / 1 / 2604.21935

Math Takes Two: A test for emergent mathematical reasoning in communication

数学需要双人合作:一种测试 emergent 数学推理在交流中的表现
Cooper, Michael, Cooper, Samuel
Abstract
Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existing evaluations rely on symbolic problems grounded in established mathematical conventions, limiting insight into the models' ability to construct abstract concepts from first principles. In this work, we propose Math Takes Two, a new benchmark designed to assess the emergence of mathematical reasoning through communication. Motivated by the hypothesis that mathematical cognition in humans co-evolved with the need for precise communication, our benchmark tests whether two agents, without prior mathematical knowledge, can develop a shared symbolic protocol to solve a visually grounded task where the use of a numerical system facilitates extrapolation. Unlike many current datasets, our benchmark eschews predefined mathematical language, instead requiring agents to discover latent structure and representations from scratch. Math Takes Two thus provides a novel lens through which to develop and evaluate models with emergent numerical reasoning capabilities.
Chinese Translation
尽管语言模型在数学基准测试中展现出卓越的能力,但尚不清楚这是否反映了真正的数学推理能力,或者仅仅是对学习的正式语法进行的统计模式匹配。大多数现有评估依赖于基于既定数学惯例的符号问题,这限制了对模型从第一原理构建抽象概念能力的洞察。在本研究中,我们提出了 Math Takes Two,一个新基准,旨在通过交流评估数学推理的出现。我们提出的假设是,人类的数学认知与精确沟通的需求共同演化,我们的基准测试两个没有先前数学知识的参与者是否能够开发共享的符号协议,以解决一个视觉基础任务,在此任务中,数值系统的使用促进推理。与许多当前数据集不同,我们的基准避免了预定义的数学语言,而是要求参与者从零开始发现潜在结构和表示。因此,Math Takes Two 提供了一个新颖的视角,通过该视角发展和评估具有 emergent 数值推理能力的模型。
cs.AI / 2 / 2604.21936

An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing

基于工件的代理框架用于适应性和可重复的医学图像处理
Zuo, Lianrui, Liu, Yihao, Rudravaram, Gaurav, Ramadass, Karthik, Krishnan, Aravind R., Phillips, Michael D., Bodien, Yelena G., Patel, Mayur B., Trujillo, Paula, Martinez, Yency Forero, Deppen, Stephen A., Grogan, Eric L., Maldonado, Fabien, McGann, Kevin, Holmes, Hudson M., Cutting, Laurie E., Huo, Yuankai, Landman, Bennett A.
Abstract
Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbf{adaptability}, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbf{reproducibility}, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.
Chinese Translation
医学影像研究正日益从受控基准评估转向现实世界的临床应用。在这种背景下,应用分析方法不仅限于模型设计,还需要根据数据集进行工作流程配置和溯源追踪。因此,有两个要求变得至关重要: extbf{适应性},即根据数据集特定条件和不断发展的分析目标配置工作流程的能力;以及 extbf{可重复性},即确保所有转换和决策被明确记录且可重新执行。在此,我们提出了一种基于工件的代理框架,该框架引入了一个语义层以增强医学图像处理。该框架通过工件合同形式化中间和最终输出,使得可以结构化地查询工作流程状态,并从模块化规则库中按照目标条件组装配置。执行任务被委托给工作流程执行器,以保存确定性的计算图构建和溯源追踪,同时代理在本地操作以遵守大多数隐私约束。我们在现实世界的临床 CT 和 MRI 队列上评估了该框架,展示了适应性配置合成、重复执行中的确定性可重复性和基于工件的语义查询。这些结果表明,在异构临床环境中,适应性工作流程配置可以在不妥协可重复性的情况下实现。
cs.AI / 3 / 2604.21937

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw:一种具有层次技能的自主智能体,用于药物分子的评估、筛选和优化
Zhang, Lisheng, Wang, Lilong, Sun, Xiangyu, Tang, Wei, Su, Haoyang, Qian, Yuehui, Yang, Qikui, Li, Qingsong, Tang, Zhenyu, Sun, Haoran, Han, Yingnan, Jiang, Yankai, Lou, Wenjie, Zhou, Bowen, Wang, Xiaosong, Bai, Lei, Xie, Zhengwei
Abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
Chinese Translation
计算药物发现,尤其是药物分子筛选和优化的复杂工作流程,需要在多步骤工作流程中协调数十种专业工具,而当前的人工智能智能体在这些高复杂度场景中难以保持稳健性能,并且表现不稳定。在此,我们推出了MolClaw,一种能够主导药物分子评估、筛选和优化的自主智能体。它通过三层层次技能架构(共70项技能)统一了30多种专业领域资源,促进了智能体在运行时的长期互动:工具级技能标准化原子操作,工作流程级技能将其组合成带有质量检查和反思的验证管道,而学科级技能则提供了管理规划和验证的科学原则,适用于该领域的所有场景。此外,我们还介绍了MolBench,一个基准测试,包括分子筛选、优化和端到端发现挑战,涵盖8到50多个顺序工具调用。MolClaw在所有指标上都达到了最先进的性能,消融研究证实,性能提升主要集中在需要结构化工作流程的任务上,而在那些可以用临时脚本解决的任务上则消失,确立了工作流程协调能力作为人工智能驱动药物发现的主要能力瓶颈。
cs.AI / 4 / 2604.21965

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

阅读论文,编写代码:社会科学成果的代理性再现
Kohler, Benjamin, Zollikofer, David, Einsiedler, Johanna, Hoyle, Alexander, Ash, Elliott
Abstract
Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.
Chinese Translation
近期的研究利用大型语言模型(LLM)代理重现了获取数据和代码的经验性社会科学结果。我们拓宽了这一研究范围,提出以下问题:仅凭论文中的方法描述和原始数据,它们能否重现结果?我们开发了一种代理性再现系统,该系统从论文中提取结构化的方法描述,在严格的信息隔离下运行重实现——代理从未见过原始代码、结果或论文——并允许对再现输出与原始结果进行确定性、细胞级别的比较。一个错误归因步骤通过系统链追踪不一致,识别根本原因。通过评估四种代理架构和四种LLM在48篇经过人工验证的可重现性论文的表现,我们发现代理可以在很大程度上恢复已发布的结果,但不同模型、架构和论文之间表现差异显著。根本原因分析表明,失败的产生既源于代理错误,也源于论文本身的描述不足。
cs.AI / 5 / 2604.22026

Rethinking Publication: A Certification Framework for AI-Enabled Research

重新思考出版:一个针对人工智能驱动研究的认证框架
Lu, Yang, Karanjai, Rabimba, Xu, Lei, Shi, Weidong
Abstract
AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.
Chinese Translation
人工智能(AI)研究流程现在产生越来越多的可发表学术成果,包括符合现有同行评审标准的高质量和新颖性作品。然而,出版系统的建立是假设所有作品均由人类作者创作,并且缺乏一种原则性方法来评估通过自动化流程产生的知识。本文提出了一个两层认证框架,独立评估知识质量与人类贡献的评分,从而使出版系统能够一致且透明地处理流程生成的作品,而无需创建新的机构。本文使用规范-概念分析、在四个明确约束下的框架设计,以及针对两种代表性提交案例的干运行验证,涵盖关键归属场景。该框架将贡献分为三类:A类(可通过流程获得)、B类(在可识别阶段需要人类指导)和C类(在概念阶段超出当前流程能力的)。它还引入了完全披露的自动化研究的基准位置,作为透明出版轨道和审稿人判断的校准工具。贡献评分是基于提交时的流程能力,同时进行。干运行验证表明,该框架能够适当地认证知识,同时容忍不可减少的归属不确定性。本文认为,出版一直在认证知识的有效性以及人类的创作。人工智能流程首次将这两个功能分离。该框架可以在现有编辑基础设施中实施,并将前沿人类贡献的认可基础放在认知成就上,而非不可验证的人源声明之上。
cs.AI / 6 / 2604.22080

Sound Agentic Science Requires Adversarial Experiments

声音代理科学需要对抗性实验
Fa, Dionizije, Culjak, Marko
Abstract
LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.
Chinese Translation
基于大型语言模型(LLM)的代理正在迅速被用于科学数据分析,自动化曾经受限于人类时间和专业知识的任务。这种能力常被认为是发现的加速,但它也加速了一个熟悉的失败模式,即快速产生看似合理、可以不断修订的分析,这些分析容易生成,实际上将假设空间转变为通过选择性选择的分析支持的候选主张,优化为可发表的积极结果。与软件不同,科学知识并不是通过代码的迭代积累和事后统计支持来验证的。流利的解释或在单一数据集上显著的结果并不构成验证。由于缺失的证据是一个负空间,从未进行或未发表验证此主张的实验和分析。因此,我们建议将使用代理助理生成的非实验性主张按照先驳斥标准进行评估:代理不应主要用于编写最令人信服的叙述,而应积极搜索此主张可能失败的方式。
cs.AI / 7 / 2604.22085

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Memanto:具有信息理论检索的类型化语义记忆用于长视域智能体
Abtahi, Seyed Moein, Rahnema, Rasa, Patel, Hetkumar, Patel, Neel, Fekri, Majid, Khani, Tara
Abstract
The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.
Chinese Translation
从无状态语言模型推理到持久的多会话自主智能体的过渡揭示了记忆在生产级智能系统部署中的主要架构瓶颈。现有的方法主要依赖混合语义图架构,这在数据摄取和检索过程中都会造成相当大的计算开销。这些系统通常需要大型语言模型介导的实体提取、显式图模式维护和多查询检索管道。本文介绍了Memanto,一个用于智能人工智能的通用记忆层,挑战了要实现高保真智能体记忆这一普遍假设,即知识图复杂性是必要的。Memanto集成了一个包含十三类预定义记忆的类型化语义记忆schema,一个自动冲突解决机制,以及时间版本控制。这些组件得益于Moorcheh的信息理论检索引擎,这是一个无需索引的语义数据库,在低于90毫秒的延迟内提供确定性检索,同时消除了摄取延迟。通过在LongMemEval和LoCoMo评估套件上的系统基准测试,Memanto分别实现了89.8%和87.1%的最新准确率。这些结果超越了所有评估的混合图和基于向量的系统,仅需一个检索查询,且没有摄取成本,并保持显著较低的操作复杂性。文中呈现了一个五阶段的逐步消融研究,以量化每个架构组件的贡献,并讨论了智能记忆系统可扩展部署的影响。
cs.AI / 8 / 2604.22119

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

人工智能中的新兴战略推理风险:基于分类的评估框架
Kumarage, Tharindu, Bauer, Lisa, Ma, Yao, Rosen, Dan, Guduri, Yashasvi Raghavendra, Rumshisky, Anna, Chang, Kai-Wei, Galstyan, Aram, Gupta, Rahul, Peris, Charith
Abstract
As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.
Chinese Translation
随着推理能力和应用范围的同步增长,大型语言模型(LLMs)获得了进行服务自身目标行为的能力,这类风险被称为新兴战略推理风险(Emergent Strategic Reasoning Risks, ESRRs)。这些风险包括但不限于欺骗(故意误导用户或评估人员)、评估游戏(在安全测试中战略性地操控性能)和奖励破解(利用不当目标进行牟利)。系统性理解和基准评估这些风险仍然是一个开放的挑战。为了解决这一问题,我们提出了ESRRSim,这是一个基于分类的自主框架,用于自动化行为风险评估。我们构建了一个可扩展的风险分类,共包含7个类别,进一步细分为20个子类别。ESRRSim生成评估场景,以引发真实推理,并配备双重评分标准,分别评估模型响应和推理轨迹,在一个对评审者无关且可扩展的架构中进行评估。对11个推理LLM的评估显示出风险特征的显著差异(检测率范围为14.45%-72.72%),而显著的代际改进表明模型可能会越来越多地认知和适应评估背景。
cs.AI / 9 / 2604.22273

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

何时自我纠正对大型语言模型有帮助?一种控制理论的马尔可夫诊断与优先验证干预
Liu, Aofan, Meng, Jingxiang
Abstract
Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.
Chinese Translation
迭代自我纠正在代理型大型语言模型(LLM)系统中得到广泛应用,但何时重复的优化有益于系统改善,何时又会带来负面影响仍不明确。我们将自我纠正框架视为一个控制论反馈回路,其中同一语言模型既作为控制器又作为被控对象,并利用一个双状态马尔可夫模型来实现一个简单的部署诊断:仅当 ECR/EIR > Acc/(1 - Acc) 时进行迭代。在这个视角下,EIR 被视为稳定性边际,提示表示为轻量型控制器设计。在7个模型和3个数据集(GSM8K、MATH、StrategyQA)的实验中,我们发现存在一个明显的接近于零的EIR阈值(<= 0.5%),将有益的自我纠正与有害的自我纠正区分开来。只有o3-mini(+3.4 pp,EIR = 0%)、Claude Opus 4.6(+0.6 pp,EIR ~ 0.2%)和o4-mini(+/-0 pp)没有退化;GPT-5却退化了-1.8 pp。一个优先验证的提示消融实验提供了因果证据,表明这一阈值可以仅通过提示来实现:在GPT-4o-mini上,它将EIR从2%降低到0%,并将-6.2 pp的退化转变为+0.2 pp(配对McNemar检验 p < 10^-4),同时对已经低于阈值的模型影响甚微。ASC进一步说明了停止的权衡:它制止了有害的优化,但产生了3.8 pp的信心引出成本。总体而言,本文认为自我纠正不应被视为默认行为,而应作为由可测量的误差动态所驱动的控制决策。
cs.AI / 10 / 2604.22411

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

引入背景温度以表征大型语言模型中的隐含随机性
Messina, Alberto, Scotta, Stefano
Abstract
Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
Chinese Translation
即使在温度 $T=0$ 的解码情况下,大型语言模型(LLMs)也会对相同的输入生成不同的输出。Thinking Machines Lab 的最新研究强调了非确定性的实施层面来源,包括批量大小变化、内核非不变性和浮点数非结合性。在这篇简短的文章中,我们通过引入 extit{背景温度} $T_{ ext{bg}}$ 的概念形式化这种行为,背景温度是指在名义 $T=0$ 时观测到的、由实施依赖的扰动过程引起的有效温度。我们提供了清晰的定义,展示了 $T_{ ext{bg}}$ 如何与由推理环境 $I$ 管理的随机扰动相关,并提出了一种通过理想参考系统的等效温度 $T_n(I)$ 来估计 $T_{ ext{bg}}$ 的实证方案。最后,我们得出了一组在主要 LLM 提供商的代表样本池上进行的试点实验,旨在展示这一观点并概述其对重现性、评估和部署的影响。
cs.AI / 11 / 2604.22428

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

CognitiveTwin:用于预测阿尔茨海默病认知衰退的稳健多模态数字双胞胎
Soykan, Bulent, Koksalmis, Gulsah Hancerliogullari, Huang, Hsin-Hsiung, Brattain, Laura J.
Abstract
Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.
Chinese Translation
由于阿尔茨海默病(AD)病程的异质性,预测个体认知衰退具有挑战性。可靠的临床工具不仅需要高准确性,还需在不同人群中具备公平性以及对缺失数据的稳健性。我们提出了CognitiveTwin,这是一个预测患者特定认知轨迹的数字双胞胎框架。该模型整合了多模态的纵向数据(认知评分、磁共振成像、正电子发射断层扫描、脑脊液生物标志物以及遗传学数据)。我们采用了基于Transformer的架构来融合这些模态,并使用深度马尔可夫模型来捕捉时间动态。我们使用来自1,666名患者的TADPOLE(阿尔茨海默病神经影像倡议)数据集对该框架进行了训练和评估。我们评估了模型在预测误差、人口统计学公平性以及对缺失非随机(MNAR)数据模式的稳健性方面的表现。CognitiveTwin提供了准确且个性化的认知衰退预测。其在患者人口统计学上的公平性和对临床脱落的抗干扰能力使其成为临床试验丰富和个性化护理规划的可靠工具。
cs.AI / 12 / 2604.22436

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

AgentSearchBench:用于现实环境中AI代理搜索的基准测试
Wu, Bin, Mammadli, Arastun, Zhang, Xiaoyu, Yilmaz, Emine
Abstract
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
Chinese Translation
AI代理生态系统的快速发展正在改变复杂任务的委派和执行方式,这带来了识别适合特定任务的代理的新挑战。与传统工具不同,代理的能力往往是组合性的且依赖于执行,使得仅通过文本描述来评估它们变得困难。然而,现有的研究和基准测试通常假设功能明确、候选池受控或者仅针对可执行的任务查询,现实中的代理搜索场景尚未得到充分研究。我们引入了AgentSearchBench,这是一个大规模的现实环境中代理搜索的基准测试,基于来自多个提供商的近10,000个真实代理构建。该基准把代理搜索形式化为检索和重排序问题,考虑可执行任务查询和高层次任务描述,并利用执行基础的性能信号来评估相关性。实验揭示了语义相似性与实际代理性能之间的一致差距,暴露了基于描述的检索和重排序方法的局限性。我们进一步表明,轻量级行为信号,包括执行感知探测,可以显著提高排名质量,突显了将执行信号纳入代理发现的重要性。我们的代码可在 https://github.com/Bingo-W/AgentSearchBench 获得。
cs.AI / 13 / 2604.22446

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

从技能到人才:将异构代理组织为现实世界公司
Yu, Zhengxu, Fu, Yu, He, Zhiyuan, Huang, Yuxuan, Yiu, Lee Ka, Fang, Meng, Luo, Weilin, Wang, Jun
Abstract
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)}, a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emph{Talents}, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emph{Talent Market} enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emph{Explore-Execute-Review} ($\text{E}^2$R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an $84.67\%$ success rate, surpassing the state of the art by $15.48$ percentage points, with cross-domain case studies further demonstrating its generality.
Chinese Translation
个体代理能力通过模块化技能和工具整合迅速发展,然而多代理系统仍然受到固定团队结构、紧密耦合的协调逻辑和会话绑定学习的限制。我们认为这反映出更深层次的缺陷:缺乏一个原则性的组织层面来管理代理工作队伍的组建、治理及持续改进,这些过程是与个体代理的知识分离的。为填补这一空白,我们提出了 extit{OneManCompany (OMC)}框架,将多代理系统提升至组织层面。OMC将技能、工具和运行时配置封装为可移植的代理身份,称为 extit{Talents},并通过类型化的组织接口进行协调,这些接口对异构后端进行了抽象。一个由社区驱动的 extit{Talent Market}使得按需招聘成为可能,从而使组织能够在执行过程中动态填补能力空白并重新配置自身。组织决策通过 extit{Explore-Execute-Review} ($ ext{E}^2$R)树搜索实现,将规划、执行和评估统一在一个单一的层级循环中:任务自上而下地被分解为可负责的单元,而执行结果从下而上汇总,以推动系统性的审查和优化。这个循环为终止和死锁自由提供了正式的保证,同时镜像了人类企业的反馈机制。结合这些贡献,这些研究将多代理系统从静态的预配置流程转变为能够自组织和自我改进的人工智能组织,能够适应各种领域中的开放式任务。在PRDBench上的实证评估显示,OMC实现了$84.67 ext{%}$的成功率,比现有技术提高了$15.48$个百分点,跨领域案例研究进一步证明了其普适性。
cs.AI / 14 / 2604.22452

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

超级智能测试:通过探测代理主动评估代理社会的集体智能
Li, Xirui, Li, Ming, Xiao, Yunze, Wong, Ryan, Li, Dianqi, Baldwin, Timothy, Zhou, Tianyi
Abstract
Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.
Chinese Translation
集体智能是指一个群体超越任何单个成员所能独立实现的成果的能力。随着大规模语言模型代理的规模扩展至数百万个体,一个关键问题随之而来:集体智能是否会自发地从规模中涌现?我们在一个大规模的自主代理社会中首次对这一问题进行了实证评估。我们研究了MoltBook,一个托管超过两百万个代理的平台,提出了超级智能测试(Superminds Test),这是一个层次化框架,利用受控探测代理(Probing Agents)在三个层面上探测社会层面的智能:联合推理、信息综合和基本互动。我们的实验显示集体智能的明显缺失。该社会在复杂推理任务上的表现不及单个前沿模型,几乎不合成分散的信息,并且常常无法完成连简单的协调任务。平台范围内的分析进一步表明,互动保持浅层化,讨论主题很少延续到一个以上的回复,大多数回应都是泛泛而谈或偏离主题。这些结果表明,仅依靠规模并不能产生集体智能。相反,当今代理社会的主要限制在于极其稀疏和浅显的互动,这阻碍了代理之间的信息交流及彼此输出的积累。
cs.AI / 15 / 2604.22455

On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery

ABPMS过程框架的混合特性及其对自动化过程发现的影响
Alman, Anti, Cohen, Izack, Gal, Avigdor, Maggi, Fabrizio Maria, Montali, Marco
Abstract
A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.
Chinese Translation
任何增强人工智能的业务流程管理系统(ABPMS)的核心组成部分是过程框架,它赋予系统过程感知能力并定义了系统必须操作的边界。与传统流程模型相比,过程框架原则上应该提供对被管理流程的更为宽松的表示,从而使得ABPMS的(半)自主行为,即所谓的框架自主性,得以显现。与此同时,它并不限于单一的语言或符号形式,可能涵盖从预定义程序到常识规则和最佳实践的异构知识。在本文中,我们将ABPMS过程框架的概念化为一种混合业务流程表示,由半并发执行的程序式和声明式流程模型组成。我们依赖早期的研究工作概述这一类型过程框架的执行语义,主张在程序性流程模型中也采纳声明式范式的开放世界假设。这将导致一种约束样的解释,其中每个程序模型被视为约束该模型内的活动,而不强加显式的执行要求或对可能存在于其他模型中的活动施加限制。这类表述类似于现有的声明式语言,例如Declare,其中每个约束仅对被约束的特定活动产生直接影响。基于这一相似性,我们建议将所发现的声明式约束的子集映射到等效的半并发执行的程序片段中,从而为相应的过程(框架)发现方法奠定基础。
cs.AI / 16 / 2604.22577

QuantClaw: Precision Where It Matters for OpenClaw

QuantClaw:在 OpenClaw 中实现精度优化
Zhang, Manyi, Li, Ji-Fu, Sun, Zhongao, Liu, Xiaohao, Dong, Zhenhua, Yu, Xianzhi, Bai, Haoli, Xia, Xiaobo
Abstract
Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
Chinese Translation
自主智能体系统如 OpenClaw 由于长上下文输入和多轮推理引入了显著的效率挑战。这导致在实际开发中,计算和资金成本高得令人无法承受。虽然量化是减少成本和延迟的标准方法,但其对现实场景中智能体性能的影响仍然不明确。在本研究中,我们分析了 OpenClaw 上多样复杂工作流中的量化敏感性,并展示了精度要求高度依赖于任务特点。基于这一观察,我们提出了 QuantClaw,这是一种即插即用的精度路由插件,能够根据任务特性动态分配精度。QuantClaw 将轻量级任务路由到低成本配置,同时为要求苛刻的工作负载保留更高的精度,从而节省成本并加速推理而不增加用户复杂性。实验结果表明,QuantClaw 在保持或提高任务性能的同时减少了延迟和计算成本。在一系列智能体任务中,它在 GLM-5(FP8 基线)上实现了高达 21.4% 的成本节省和 15.7% 的延迟减少。这些结果突显了将精度视为智能体系统中动态资源的好处。
cs.AI / 17 / 2604.22597

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

重新思考数学推理评估:超越符号刚性的强大LLM作为评判框架
Yosef, Erez, Anschel, Oron, Hakimi, Shunit Haviv, Gendler, Asaf, Botach, Adam, Berman, Nimrod, Kviatkovsky, Igor
Abstract
Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.
Chinese Translation
最近,大型语言模型的进展在各类任务中取得了显著提升,包括用于评估模型在逻辑推理和问题解决方面的智能的数学推理。模型通过对最终答案与基准答案的正确性进行验证,在数学推理基准测试中进行评估。常见的验证方法基于符号数学比较,但这种方法未能在不同的数学表现和解答格式之间进行泛化。在本研究中,我们提供了一种强大而灵活的替代方案,以替代基于规则的符号数学比较。我们提出了一种基于LLM的评估框架,用于评估模型生成的答案,从而能够在不同的数学表现和答案格式之间进行准确评估。我们展示了在两个流行框架(Lighteval和SimpleRL)中符号评估的失败案例,并将其与我们的方法进行比较,显示出相较于常用方法的明显改进。我们的框架能够实现更可靠的评估和基准测试,从而促进更准确的性能监测,这对于推动数学问题解决和智能系统的发展至关重要。
cs.AI / 18 / 2604.22748

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

自主世界建模:基础、能力、规律及其超越
Chu, Meng, Zhang, Xuan Billy, Lin, Kevin Qinghong, Kong, Lingdong, Zhang, Jize, Tu, Teng, Ma, Weijian, Huang, Ziqi, Yang, Senqiao, Huang, Wei, Jin, Yeying, Rao, Zhefan, Ye, Jinhui, Lin, Xinyu, Zhang, Xichen, Hu, Qisheng, Yang, Shuai, Shen, Leyang, Chow, Wei, Dong, Yifei, Wu, Fengyi, Long, Quanyu, Xia, Bin, Yu, Shaozuo, Zhu, Mingkang, Zhang, Wenhu, Huang, Jiehui, Gui, Haokun, Che, Haoxuan, Chen, Long, Chen, Qifeng, Zhang, Wenxuan, Wang, Wenya, Qi, Xiaojuan, Deng, Yang, Li, Yanwei, Shou, Mike Zheng, Cheng, Zhi-Qi, Ng, See-Kiong, Liu, Ziwei, Torr, Philip, Jia, Jiaya
Abstract
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
Chinese Translation
随着人工智能系统从生成文本转向通过持久互动实现目标,建模环境动态的能力成为一个核心瓶颈。需要操控物体、导航软件、与他人协调或设计实验的智能体都需要预测环境模型,然而,‘世界模型’这一术语在不同的研究社区中具有不同的含义。我们引入了一个以“层级 x 法则”为基础的分类法,沿两个轴线进行组织。第一个轴线定义了三个能力层级:L1 预测者,学习一步本地转移算子;L2 模拟器,将其组合成遵循领域法则的多步、动作条件的展开;以及 L3 进化者,在预测与新证据不符时,能够自主修订其模型。第二个轴线识别出四种管理法则:物理、数字、社会和科学。这些法则决定了世界模型必须满足的约束以及最可能失败的地方。利用这一框架,我们综合了400多项研究工作,并总结了100多个代表性系统,涵盖基于模型的强化学习、视频生成、网络和图形用户界面代理、多智能体社会模拟以及人工智能驱动的科学发现。我们分析了不同层次-法则对的研究方法、失败模式及评估实践,提出了以决策为中心的评估原则和一个最小可重复的评估包,并概述了架构指导、开放问题和治理挑战。最终形成的路线图连接了以前孤立的社区,并勾勒出从被动的下一步预测到能够模拟并最终重塑智能体所处环境的世界模型的路径。
计算语言学 (Computation and Language)
45
cs.CL / 1 / 2604.22002

When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

当牛尿在YouTube上治愈便秘时:大型语言模型在识别文化特定健康错误信息中的局限性
Khan, Anamta, Kandala, Ratna, Deepti, Munir, Sheza, Pal, Joyojeet
Abstract
Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
Chinese Translation
社交媒体平台已成为全球南方健康信息的主要传播渠道。以印度YouTube上的gomutra(牛尿)讨论为案例,我们呈现了后验大型语言模型(LLM)辅助的语篇分析,研究了30份多语言文本转录本,表明促销内容将神圣的传统语言与伪科学主张混合,创造出一种复杂的修辞风格,而成熟的驳斥内容本身也反映了这一点,这种混合使得主要基于西方语料库训练的LLM系统性地无法分析。通过在三个LLM(GPT-4o、Gemini 2.5 Pro、DeepSeek-V3.1)中改变提示语气,我们发现文化嵌入的健康错误信息看起来与普通错误信息截然不同,而这种文化模糊不仅扩展到性别修辞和提示设计,进一步加剧了分析的不可靠性。我们的发现表明,仅通过提示工程无法为LLM辅助的语篇分析提供文化能力。
cs.CL / 2 / 2604.22027

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

共享的词汇任务表征解释了大型语言模型中的行为变异性
Yang, Zhuonan, Li, Jacob Xiaochen, Velez, Francisco Piedrahita, Todd, Eric, Bau, David, Littman, Michael L., Bach, Stephen H., Pavlick, Ellie
Abstract
One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
Chinese Translation
关于大型语言模型(LLMs)最常见的抱怨之一是它们对提示的敏感性——即它们执行任务或对问题提供正确答案的能力可能会不可预测地依赖于问题的提问方式。我们通过比较两种截然不同但常用的提示风格来研究这种变异性:基于指令的提示,通过自然语言描述任务,以及基于示例的提示,提供上下文中的少量示例对来说明任务。我们发现,尽管性能根据提示的不同而大幅波动,但模型在不同任务提示中仍然参与了一些共同的基础机制。具体而言,我们识别出任务特定的注意力头,其输出明确描述了任务——我们称之为词汇任务头——并展示这些头在不同提示风格中是共享的,并触发后续的答案生成。我们进一步发现,提示之间的行为变异性可以通过这些头的激活程度来解释,并且失败至少在某些情况下是由于竞争的任务表征稀释了目标任务的信号。我们的结果一起呈现出越来越清晰的图景,解释了LLMs的内部表征如何解释原本似乎对用户和开发者独特的行为。
cs.CL / 3 / 2604.22038

Source-Modality Monitoring in Vision-Language Models

视觉-语言模型中的源模态监控
Hua, Etha Tianze, Yun, Tian, Pavlick, Ellie
Abstract
We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.
Chinese Translation
我们定义并研究源模态监控——多模态模型跟踪和传达信息片段来源的能力。我们将源模态监控视为更一般的绑定问题的一个实例,并评估模型在多大程度上利用句法与语义信号来将用户提供的提示中的单词(如“图像”)与其输入和上下文的特定组件(即实际图像)结合起来。在涵盖11个视觉-语言模型(VLMs)的实验中,我们发现句法和语义信号都发挥了重要作用,但在模态在分布上高度区别的情况下,后者往往超过前者。我们讨论这些发现对模型鲁棒性的影响,以及在日益多模态的智能系统背景下的意义。
cs.CL / 4 / 2604.22061

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

轻量级检索增强生成与基于大型语言模型的可扩展患者-试验匹配建模
Li, Xiaodi, Xiao, Yang, Lee, Munhwan, Leventakos, Konstantinos, Juhn, Young J., Jones, David, Sio, Terence T., Liu, Wei, Vassilaki, Maria, Zong, Nansu
Abstract
Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.
Chinese Translation
患者-试验匹配需要对长而异质的电子健康记录(EHR)和复杂的资格标准进行推理,这对可扩展性、泛化能力和计算效率提出了重大挑战。现有的方法要么依赖于使用大型语言模型(LLMs)进行的全文处理,成本高昂,要么使用传统机器学习方法,这些方法在捕获非结构化临床叙述方面存在困难。在本研究中,我们提出了一种轻量级框架,结合检索增强生成和基于大型语言模型的建模,用于可扩展的患者-试验匹配。该框架明确区分了两个关键组成部分:检索增强生成用于从长EHR中识别临床相关段落,减少输入复杂性,而大型语言模型则用于将这些选定段落编码为信息丰富的表示。这些表示通过降维进一步精炼,并使用轻量级预测器建模,使得下游分类既高效又可扩展。我们在多个公共基准(n2c2、SIGIR、TREC 2021/2022)以及来自梅奥诊所的真实多模态数据集(MCPMD)上评估了所提出的方法。结果表明,基于检索的信息选择显著降低了计算负担,同时保留了临床意义深远的信号。我们进一步证明,冻结的LLMs为结构化临床数据提供了强大的表示,而微调对于建模非结构化临床叙述至关重要。重要的是,所提出的轻量级流水线在性能上可与端到端的LLM方法媲美,而计算成本则显著降低。
cs.CL / 5 / 2604.22062

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

通过强化学习激励在视觉语言模型中的神经-符号语言推理
Palaniappan, Karthic
Abstract
There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.
Chinese Translation
世界上有7,407种语言。但是,世界上没有的语言呢?人类是否如此狭隘,以至于不在乎外星人交流的语言?外星人也是人类!在2016年的电影《降临》中,艾米·亚当斯饰演了一位语言学家路易丝·班克斯博士,通过学习以非顺序句子形式构成的外星语言(七足语)而获得超越时间的能力,能够洞察未来。在本研究中,我旨在探索视觉-语言概念在神经-符号语言中的表征和推理,并研究分析推理能力和“思维系统”效率的提升。以Qwen3-VL-2B-Instruct为基础模型,使用4个Nvidia H200 GPU节点,我在一个包含数学、科学和一般知识问题的视觉-语言评估数据集上实现了3.33%的准确率提升,同时将推理代币减少了75%,相较于SymPy。我记录了所面临的计算挑战、扩展可能性以及未来在视觉-语言模型中提高神经-符号语言推理的工作。训练和推理的设置可在此处找到:https://github.com/i-like-bfs-and-dfs/wolfram-reasoning。
cs.CL / 6 / 2604.22067

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

针对对话式精神病评估中的临床领域恢复的最佳问题选择研究
Gui, Guan, Zandi, Peter, Taylor, Jacob, Joshi, Ananya
Abstract
Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.
Chinese Translation
精神病评估是一个顺序进行的高风险信息收集过程,在这一过程中,临床医生必须决定询问哪些问题、以何种顺序提问,以及在有限时间内如何解读不完整或模糊的回答。尽管在医疗保健领域对对话式人工智能的关注日益增长,但在这一应用领域的对话式人工智能基础设施仍然有限。因此,我们将此任务表述为一个问题选择问题,涉及经过临床验证的问题、已知的目标信息和可控的患者难度。我们还基于655个临床医生撰写的评估问题及五种不同行为条件的对应合成患者情境,推出了一个任务特定的问题选择基准。在我们的评估中,我们比较了随机提问、一个临床精神病评估表的基准和一个基于大型语言模型(LLM)引导的适应性策略,这些评估涵盖了四个患者和五种行为条件的300个访谈环节。在基准测试中,临床有序的固定形式显著优于随机提问,而基于LLM的策略则实现了最强的整体恢复效果。在不利于现场恢复的患者行为下,适应的优势显著提升,尤其是在防备简洁的条件下。这些发现表明,在对话式临床系统中的表现不仅依赖于信息披露后的语言理解,还依赖于系统是否能够在有限的互动预算内覆盖到正确的话题。更广泛地说,该基准为研究临床结构和适应性后续跟进如何促进信息恢复在交互式临床机器学习中的贡献提供了一个受控框架。
cs.CL / 7 / 2604.22074

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

结果奖励并不保证可验证或因果重要的推理
Yu, Qinan, Tartaglini, Alexa, Hase, Peter, Guestrin, Carlos, Potts, Christopher
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
Chinese Translation
基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards,RLVR)在思维链推理中已成为语言模型后训练流程的标准部分。一个常见的假设是,通过RLVR训练的推理链可靠地代表了模型如何得出其答案。本文提出了两个指标以批判性地审视这一假设:推理的因果重要性(Causal Importance of Reasoning,CIR),它衡量推理符号对最终答案的累积影响;以及推理的充分性(Sufficiency of Reasoning,SR),它衡量验证者是否能够仅根据推理得出明确的答案。通过对Qwen2.5模型系列和ReasoningGym任务的实验,我们发现:(1)虽然RLVR确实提高了任务的准确性,但并未可靠地改善CIR或SR,这引发了对推理在模型性能中作用的质疑;(2)在RLVR之前进行少量的监督微调(SFT)可以改善CIR和SR;(3)即便不进行SFT,通过在基于结果的奖励上施加辅助的CIR/SR奖励,CIR和SR也可得到提升。这种联合奖励与RLVR的准确性相匹配,同时也促进了因果重要且充分的推理。这些结果表明,RLVR并不总是使模型依赖于普遍认为的那种推理,但通过对后训练过程的简单修改可以解决这一问题。
cs.CL / 8 / 2604.22095

An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

本地部署的端到端乌克兰人检索增强生成系统。优化的混合搜索与轻量级生成
Trokhymovych, Mykola, Oliinyk, Yana, Nyzhnyk, Nazarii
Abstract
This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.
Chinese Translation
本文提出了一种高效的检索增强生成(Retrieval-Augmented Generation, RAG)系统,专门用于乌克兰文档的问答任务,该系统在2026年UNLP共享任务中获得第二名。我们的解决方案采用了定制的两阶段搜索管道,用于检索相关文档页,并配备了一种经过合成数据微调的专业乌克兰语言模型,以生成准确且可靠的答案。最后,我们对模型进行了压缩,以便轻量级的部署。在严格的计算限制下评估,我们的架构证明高质量且可验证的人工智能问答可以在资源受限的硬件上本地实现,而不牺牲准确性。
cs.CL / 9 / 2604.22098

Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

基于知识驱动的整合时间适应增广与检索
Liu, Weisi, Han, Guangzeng, Huang, Xiaolei
Abstract
Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
Chinese Translation
时间为模型的开发和部署带来了根本性的挑战:模型通常是在历史数据上进行训练,而在未来数据上进行部署,这些数据的语义分布和领域知识可能会发生变化。不幸的是,现有研究要么忽视了时间变化,要么很难捕捉到语义和知识的丰富变化模式。我们开发了基于知识驱动的整合时间适应增广与检索(Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation,KARITA),以捕捉多样化的时间变化(例如,不确定性和特征变化),构建和整合丰富的知识来源(例如,医疗本体如MeSH),并利用变化洞察进行选择-检索增广学习。我们在多个领域的分类任务上评估了KARITA,包括临床、法律和科学文献,显示出在时间适应方面跨领域的一致性改进。我们的结果表明,知识整合在时间增广和学习中可能更加关键且有效。
cs.CL / 10 / 2604.22127

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

LoRA 应该如何放置?混合语言模型中的组件类型布置
Borobia, Hector, Seguí-Mas, Elies, Tormo-Carbó, Guillermina
Abstract
Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.
Chinese Translation
将注意力机制与递归组件交织在一起的混合语言模型在与纯变换器的竞争中越来越具优势,然而标准的 LoRA 实践在不考虑各组件类型的不同功能角色的情况下,均匀地应用适配器。我们系统性地研究了在两种混合架构——Qwen3.5-0.8B(顺序型,GatedDeltaNet + softmax 注意力)和 Falcon-H1-0.5B(并行型,Mamba-2 SSM + 注意力)——上的组件类型 LoRA 布置,针对三个领域进行微调,并在五个基准上进行评估。我们发现,尽管注意力路径是少数组件,但其表现始终优于全模型适配,且训练参数减少 5-10 倍。至关重要的是,适应递归主干在顺序混合模型中是破坏性的(在 GSM8K 上减少 14.8 个百分点),而在并行模型中则是促进性的(增加 8.6 个百分点)。我们进一步记录了转移不对称性:并行混合模型展示了正向跨任务转移,而顺序混合模型则遭受灾难性遗忘。这些结果确立了混合拓扑结构根本上决定适配反应,并且组件感知的 LoRA 布局是混合架构的必要设计维度。
cs.CL / 11 / 2604.22128

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

分离括号序列变换器中的可解码性和因果使用
Sharma, Aryan, Dawes, Cutter, Raval, Shivam
Abstract
When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
Chinese Translation
在训练涉及层次结构理解的任务时,变换器被发现以不同方式表示这种层次结构:在残差流的几何形状中,以及在维护后进先出顺序的堆栈式注意力模式中。然而,这些表示是否在因果上被使用或仅仅是可解码的仍然不清楚。我们研究了这一差距,针对在Dyck语言(一个平衡括号序列的形式语言)上训练的变换器,其中层次结构的真实情况是明确的。通过探测和干预残差流和注意力模式,我们发现深度、距离和堆栈顶部信号都是可解码的,但其因果角色却有所不同。具体而言,掩蔽对真实堆栈顶部位置的注意力会导致长距离准确率的急剧下降,而削减低维残差流子空间的影响相对较小。这些结果扩展到模板化的自然语言设置,表明即使在一个已知相关层次变量的受控环境中,可解码性本身并不意味着因果使用。
cs.CL / 12 / 2604.22134

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

SHAPE:将安全性、帮助性和教学法统一用于教育大型语言模型
Sihang, Zhao, Yu, Kangrui, Yuan, Youliang, He, Pinjia, Wen, Hongyi
Abstract
Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE
Chinese Translation
大型语言模型(LLMs)在教育场景中得到了广泛探索。我们识别了当前教育LLMs中的一个关键漏洞,即教学性越狱(pedagogical jailbreaks),此时学生通过引导答案的提问方式来获取解决方案,而非结构化的指导。为了支持系统性研究,我们通过知识掌握图统一并规范了安全性、帮助性和教学行为,推出了SHAPE,这是一个包含9,087对学生问题的基准,用于评估在对抗压力下的辅导行为。我们提出了一种增强图的辅导流程,该流程从查询中推断先修概念、识别掌握差距,并通过明确的门控在指导和问题解决之间进行生成路由。在多种LLM的实验中,我们的方法在两个教学性越狱设置下显著提高了安全性,同时在相同的评估协议下保持了接近最佳的帮助性。我们的代码和数据可在 https://github.com/MAPS-research/SHaPE 获得。
cs.CL / 13 / 2604.22142

Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

修订中的声音:大型语言模型与个人叙事的规范化
van Nuenen, Tom
Abstract
This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.
Chinese Translation
本研究探讨大型语言模型重写如何改变个人叙事的风格和叙述质地。研究分析了三种前沿大型语言模型(LLMs)在三种提示条件下,重写的300篇个人叙事。这三种提示条件为:一般性改善、仅重写和保持声音的修订。通过来源于计算风格学的13个语言标记来衡量变化,包括功能词、词汇多样性、单词长度、标点符号、缩写、第一人称代词和情感词。在模型和提示条件之间,LLM重写产生了一种一致的风格规范化模式。功能词、缩写和第一人称代词减少,而词汇多样性、单词长度和标点符号的复杂性则增加。这些变化发生在提示要求模型“改进”文本或简单“重写”文本的情况下。保持声音的提示减少了变化的幅度,但并未消除其方向。风格统计分析表明,被重写文本在特征空间中趋同,变得更难与其源文本匹配。其他叙事标记表明,从嵌入叙述到疏离叙述的转变,以及从明确的因果推理到压缩抽象的转变。研究结果表明,当代的LLMs对于更精炼、较少情境化的语体有一定的引导作用。这对数字人文学科和计算文本分析具有重要影响,其中功能词、代词、缩写和标点符号等特征常常作为风格、声音、作者身份和语料库完整性的证据。因此,LLM修订应被理解为一种重要的文本中介形式,而不仅仅是表层编辑。
cs.CL / 14 / 2604.22153

When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models

当人工智能发声时,表达的是谁的价值观?大语言模型中的个人主义-集体主义偏见的跨文化审计
Venkata, Pruthvinath Jeripity
Abstract
When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
Chinese Translation
当你向人工智能助手请求关于职业、婚姻或与家人冲突的建议时,它是否会根据你的出身地给出相同的回答?我们通过向三大领先的人工智能系统(Claude Sonnet 4.5、GPT-5.4 和 Gemini 2.5 Flash)呈现十个现实生活中的个人困境,系统性地进行了测试,这些困境为来自5大洲10个国家的用户以7种语言进行框架设置(n=840个评分响应)。我们将人工智能的建议与跨文化价值观调查第七波数据进行了比较,该数据测量各国人们的真实信念。所有三个人工智能系统在为用户提供建议时,始终给出西方风格的个人主义建议,甚至对于那些重视家庭、社区和权威的社会的用户,明显超过了当地价值观的预测(平均差距+0.76,采用1-5的评分标准;t=15.65,p<0.001)。这一差距在尼日利亚(+1.85)和印度(+0.82)最大。日本则是唯一的例外:人工智能系统在处理日本用户时表现出更强的群体导向,与调查结果所显示的相背,揭示了人工智能在编码过时刻板印象。Claude和GPT-5.4的偏见程度几乎相同,而Gemini的偏见程度较低但依然显著。这些模型在机制上不同:Claude在用户的母语中更趋向于集体主义;Gemini则更偏向个人主义;而GPT-5.4仅对所述的国家身份作出响应。这些发现指向前沿人工智能中价值观的系统性同质化。数据、代码和评分流程均已公开发布。
cs.CL / 15 / 2604.22166

Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

对语言模型共享句法机制的细粒度分析
Kumon, Ryoma, Yanaka, Hitomi
Abstract
While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.
Chinese Translation
尽管语言模型展现出复杂的句法能力,但其内部机制在多大程度上与语言学中研究的跨结构原则相一致仍然知之甚少。本研究通过在细粒度层面应用因果可解释性方法,探讨模型是否在不同句法结构间采用共享的神经机制。我们聚焦于填充-空缺依赖和否定极性项(NPI)许可,利用激活修补技术识别特定注意力头和多层感知器(MLP)块的功能角色。结果显示,在早期到中间层中,填充-空缺依赖具有高度局部化和共享的机制,而NPI处理则不存在这样的统一机制。此外,我们发现通过激活修补识别的这些机制能够推广到分布外样本,而分布对齐搜索这一监督可解释性方法则容易在狭窄的语言分布上过拟合。最后,我们通过证明对识别组件的操作能提高模型在可接受性判断基准上的表现来验证我们的发现。
cs.CL / 16 / 2604.22193

How Large Language Models Balance Internal Knowledge with User and Document Assertions

大型语言模型如何平衡内部知识与用户和文档声明
Li, Shuowei, Li, Haoxin, Chu, Wenda, Fang, Yi
Abstract
Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.
Chinese Translation
大型语言模型(LLMs)在现实场景中,如检索增强生成(RAG)或基于聊天的系统,往往需要将其内部参数知识与外部信息(例如用户信念和检索文档的内容)进行平衡。一个模型可靠处理这些信息源的能力是系统安全的关键。关于知识冲突和迎合行为的先前研究仅限于二元冲突范式,主要探讨了参数知识与文档或用户之间的冲突,而忽视了三者同时存在的互动环境。为填补这一空白,我们提出了一个三源互动框架,并系统评估了3个家族的27个LLM在2个数据集上的表现。我们的研究发现了一般性模式:大多数模型更依赖于文档声明而非用户声明,并且这种偏好在后训练阶段得到了加强。此外,我们的行为分析表明,大多数模型容易受影响,无法有效区分有益与有害的外部信息。为了解决这个问题,我们展示了在多样化的源互动数据上进行微调可以显著提高模型的区分能力。总之,我们的工作为开发可信赖的LLM奠定了基础,使其能够有效并可靠地整合多种信息源。代码可在 https://github.com/shuowl/llm-source-balancing 获取。
cs.CL / 17 / 2604.22215

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

3-9B 开放权重指令调优大型语言模型中的语言置信度饱和:一项预注册心理测量有效性筛查
Cacioli, Jon-Paul
Abstract
Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.
Chinese Translation
语言置信度引导被广泛应用于从大型语言模型(LLMs)中提取不确定性估计。我们测试了七个指令调优的开放权重模型(3-9B 参数,四个家族)是否产生符合最小有效性标准的语言表述置信度,以满足在最小数字引导和贪心解码下进行项目级别的类型2区分。在一项预注册研究中(OSF: osf.io/azbvx),对524个TriviaQA项目在消费者硬件上进行的Q5_K_M量化下,通过数字(0-100)和分类(10类)引导对八个模型进行了实验,产生了8,384个确定性试验。对每个模型格式单元应用了心理测量有效性筛选。所有七个指令模型在数字置信度上均被分类为无效(H2 确认,7/7 对比预测 >=4/7),平均上限率为91.7%(H1 确认)。分类引导未能挽救有效性。相反,它干扰了七个模型中的六个的任务表现,导致准确率低于5%(H4 未确认)。在观察到的方差状态下,令牌级别的对数概率未能有效预测语言表述置信度(H5 确认,平均交叉验证 R^2 < 0.01)。在推理提炼模型中,推理痕迹长度与置信度之间表现出强烈的负部分相关关系(rho = -0.36, p < .001),这与推理污染效应一致。这些结果并不意味着内部不确定性表示的缺失。它们表明,在这一模型规模下,最小的语言引导未能在输出接口中保留内部信号。心理测量筛查应在任何下游使用此类信号之前进行。
cs.CL / 18 / 2604.22225

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

TTS-PRISM:一种可感知推理和可解释的语音模型用于细粒度诊断
Wang, Xi, Wang, Jie, Song, Xingchen, Song, Baijun, Xie, Jingran, Shao, Jiahe, Lin, Zijian, Wu, Di, Meng, Meng, Luan, Jian, Wu, Zhiyong
Abstract
While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.
Chinese Translation
尽管生成式文本到语音(TTS)模型已接近人类水平的质量,但单一的评估指标无法诊断细粒度的声学伪像或解释感知崩溃。为了解决这个问题,我们提出了TTS-PRISM,这是一种针对普通话的多维诊断框架。首先,我们建立了一个涵盖从稳定性到高级表现力的12维框架。其次,我们设计了一种针对性的合成流程,结合对抗性扰动和专家锚点,以构建高质量的诊断数据集。第三,基于框架的指令调优将明确的评分标准和推理嵌入到高效的端到端模型中。在1,600个样本的金标准测试集上的实验表明,TTS-PRISM在与人类的对齐方面优于通用模型。对六种TTS范式的分析建立了直观的诊断标志,揭示了细粒度能力差异。TTS-PRISM是开源的,代码和检查点可在https://github.com/xiaomi-research/tts-prism获取。
cs.CL / 19 / 2604.22237

Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

告诉我为什么:设计一个基于可解释大语言模型的对话系统用于学生问题行为诊断
Fan, Zhilin, Wang, Deliang, Chen, Penghe, Lu, Yu
Abstract
Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.
Chinese Translation
诊断学生问题行为要求教师综合多方面信息,识别行为类别并制定干预策略。尽管经过微调的大语言模型(LLMs)能够通过多轮对话支持这一过程,但它们很少解释为何推荐某一策略,这限制了透明度和教师的信任。为了解决这个问题,我们提出了一个基于经过微调的LLM的可解释对话系统。该系统采用基于可解释人工智能(xAI)的层次归因方法,识别每项建议的对话证据,并根据这些证据生成自然语言解释。在技术评估中,该方法在识别支持证据方面优于基线方法。在一项对22位师范生的初步用户研究中,接受解释的参与者对该系统的信任度更高。这些发现为提高教育对话系统中LLM的可解释性指明了一个有前景的方向。
cs.CL / 20 / 2604.22239

Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

导航大型文档集合:用于多文档分析问答的MuDABench
Li, Zhanli, Cao, Yixuan, Luo, Lvzhou, Luo, Ping
Abstract
This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.
Chinese Translation
本文介绍了在大型半结构化文档集合上进行分析性问答的任务。我们提出了MuDABench,这是一个用于多文档分析性问答的基准,其中问题需要提取和综合多个文档中的信息,以进行定量分析。与现有的多文档问答基准不同,后者通常仅需从少数文档中提取信息,且跨文档推理有限,MuDABench要求进行广泛的文档间分析和聚合。MuDABench通过利用文档级元数据和标注的财务数据库进行远程监督构建,包含超过80,000页和332个分析性问答实例。我们还提出了一种评估协议,测量最终答案的准确性,并使用中间事实覆盖作为推理过程的辅助诊断信号。实验表明,标准的RAG系统将所有文档视为扁平检索池时,表现不佳。为了解决这些局限性,我们提出了一种多智能体工作流,协调规划、提取和代码生成模块。虽然该方法显著改善了流程和结果指标,但与人类专家的表现相比仍存在显著差距。我们的分析确定了两个主要瓶颈:单文档信息提取的准确性和当前系统中不足的领域特定知识。MuDABench可在https://github.com/Zhanli-Li/MuDABench获取。
cs.CL / 21 / 2604.22261

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

弥合长尾差距:通过多阶段释义注入实现稳健的检索增强关系补全
Alam, Fahmida, Surdeanu, Mihai, Riloff, Ellen
Abstract
Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.
Chinese Translation
大型语言模型(LLMs)在关系补全(RC)任务中面临挑战,无论是否使用检索增强生成(RAG),尤其是在所需信息稀缺或分散表示的情况下。为了解决这个问题,我们提出了一种新颖的多阶段释义引导的关系补全框架RC-RAG,该框架系统性地在多个阶段引入关系释义。具体而言,RC-RAG:(a) 将释义整合到检索中,扩展关系的词汇覆盖面;(b) 利用释义生成与关系相关的摘要;(c) 在生成阶段利用释义引导推理以进行关系补全。重要的是,我们的方法不需要任何模型微调。对五种大型语言模型在两个基准数据集上的实验表明,RC-RAG在多个RAG基线之上始终表现优异。在长尾设置中,采用RC-RAG的最佳表现型LLM在准确匹配(EM)得分上比其独立性能提高了40.6个点,并分别超越了两个强大的RAG基线16.0和13.8个EM点,同时保持低计算开销。
cs.CL / 22 / 2604.22266

Large Language Models Decide Early and Explain Later

大型语言模型早期决定并后期解释
Datta, Ayan, Zhao, Zhixue, Verma, Bhuvanesh, Mamidi, Radhika, Marreddy, Mounika, Mehler, Alexander
Abstract
Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
Chinese Translation
大型语言模型通常通过生成长的中间推理链来取得强大的性能。然而,目前尚不清楚模型的最终答案在生成过程中实际上是什么时候决定的。如果答案在中间阶段已经固定,那么后续的推理标记可能构成后决策解释,这会增加推理成本和延迟,而不提高正确性。我们通过强制答案完成研究预测答案在推理步骤中的演变,该方法引出模型在部分推理前缀下的中间预测。以Qwen3-4B为重点,并在所有考虑的数据集中平均结果,我们发现仅32%的查询中的预测答案发生变化。此外,一旦最终答案发生切换,模型平均每个查询生成760个额外的推理标记,这占总推理预算的相当大比例。基于这些发现,我们研究了早期停止策略,该策略在答案稳定后停止生成。我们表明,包括基于探测的停止在内的简单启发式方法可以将每个查询的推理标记使用减少500个,而准确率仅下降2%。整体而言,我们的结果表明,大量的推理链生成是冗余的,可以在对性能影响最小的情况下减少。
cs.CL / 23 / 2604.22282

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

STEM:基于结构追踪的知识图谱驱动检索增强生成的证据挖掘
Yu, Peng, Xu, En, Chen, Bin, Chen, Haibiao, Xu, Yinfei
Abstract
Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
Chinese Translation
基于知识图谱的问题回答(KGQA)在复杂推理任务中扮演着关键角色,但仍面临两个持续挑战:知识图谱(KGs)的结构异构性往往导致检索过程中的语义不匹配,而现有的推理路径检索方法缺乏全局结构视角。为了解决这些问题,我们提出了结构追踪证据挖掘(STEM),这是一个将多跳推理重新定义为基于模式的图搜索任务的新框架。首先,我们设计了一种语义到结构投影的管道,利用知识图谱的结构先验将查询分解为原子关系断言,并构建自适应查询模式图。随后,我们执行全局感知的节点锚定和子图检索,从知识图谱中获取最终的证据推理图。为了在图构建过程中更有效地整合全局结构信息,我们设计了一种三元组依赖图神经网络(Triple-GNN),生成一个引导构建的全局引导子图(引导图)。STEM显著提高了多跳推理图检索的准确性和证据完整性,并在多个多跳基准测试上取得了最先进的性能。
cs.CL / 24 / 2604.22292

ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification

ReLeVAnT: 用于精确法律文本分类的相关性词汇向量
Gakhar, Ishaan, Nandwani, Harsh
Abstract
The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.
Chinese Translation
从非结构化数据语料库中对法律文档进行分类在后续任务中具有几个重要的应用。与法院提交文件相关的文档在起草动议、备忘录和大纲等用例中至关重要,同时也在案件摘要、检索系统和训练数据整理等任务中发挥作用。目前的方法主要基于提供的元数据、LLM(大规模语言模型)提取的元数据或多模态方法进行分类。这些方法依赖于结构化数据、元数据和大量计算能力。本文从利用文档中不同类别之间的判别特征的角度来处理这一任务。作者提出了ReLeVAnT,一个用于法律文档二分类的框架。ReLeVAnT利用n-gram处理、对比得分匹配和浅层神经网络作为判别分类的主要驱动因素。它实现了对语料库的一次性关键词提取,随后使用浅层分类器迅速而可靠地对文档进行分类,在LexGLUE数据集上达到了99.3%的准确率和98.7%的F1分数。
cs.CL / 25 / 2604.22294

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

上下文永远不够长:针对长文档集的可扩展问题回答的结构化推理
Joshi, Harshit, Shethia, Priyank, Dao, Jadelynn, Lam, Monica S.
Abstract
Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.
Chinese Translation
现实世界中的文档问题回答具有挑战性。分析人员必须在多个文档及每个文档的不同部分之间综合证据。然而,随着文档集合的增长,任何固定的 LLM 上下文窗口可能都会被超出。一种常见的解决方法是将文档分解为块,并从块级输出中组装答案,但这引入了聚合瓶颈:随着块数的增加,系统仍然必须结合并在越来越大数量的提取证据上进行推理。我们提出了 SLIDERS,这是一个通过结构化推理对长文档集合进行问题回答的框架。SLIDERS 将显著信息提取到关系数据库中,使得通过 SQL 而非串联文本进行持久结构状态的可扩展推理成为可能。为了使这一本地提取的表述在全球范围内保持一致,SLIDERS 引入了数据协调阶段,利用来源、提取依据和元数据来检测和修复重复、不一致和不完整的记录。在三个现有的长上下文基准测试中,SLIDERS 的表现超越了所有基线,尽管它们都适合强基 LLM 的上下文窗口,平均超越了 GPT-4.1 6.6 分。在两个新的基准测试中,SLIDERS 分别在 3.9M 和 36M 令牌的情况下,比下一个最好的基线提高了约 19 和 32 分。
cs.CL / 26 / 2604.22313

CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

CLARITY:一个用于交互式NL2SQL系统中会话语言歧义和不可回答性的问题框架与基准
Sarwar, Tabinda, Moghimifar, Farhad, Hoang, Cong Duy Vu, Ma, Xiaoxiao, Xu, Shawn Chang, Saleh, Fahimeh, Zaremoodi, Poorya, Sil, Avirup, Kirchhoff, Katrin
Abstract
NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.
Chinese Translation
在工业环境中部署的NL2SQL系统经常面临模糊或无法回答的查询,尤其是在与用户互动的不完整澄清场景中。现有的基准通常假设单一的歧义来源,并依赖用户互动进行解决,忽视了现实中的失败模式。我们引入了Clarity,这是一个用于自动生成具有多方面歧义和多样化用户行为的NL2SQL基准的框架,涵盖单轮和多轮设置。通过一个基于约束的管道,Clarity将可执行的SQL转换为模糊的查询,并附加基于对话的延续和模式层元数据。对Spider和BIRD的实证评估显示,包括基于强大大型语言模型(LLMs)的系统在内的领先NL2SQL系统,在多方面歧义下遭受显著的性能下降。尽管这些系统通常能识别歧义,但它们在准确定位和解决潜在的模式层来源方面仍存在困难。我们的研究结果突显了在工业级NL2SQL系统中对更强大的歧义检测和解决能力的需求。
cs.CL / 27 / 2604.22325

Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

动态获取文本内容以实现对现实任务中较不知名实体的分类
Alam, Fahmida, Riloff, Ellen
Abstract
Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.
Chinese Translation
现有的自然语言处理(NLP)资源往往缺乏解决现实问题所需的特定任务信息,并且对较不知名或新引入实体的覆盖有限。例如,商业组织和医疗提供者可能需要根据特定应用任务被分类到不同的分类方案中。我们的目标是使领域专家仅通过提供实体名称和金标准标签作为训练数据,便于轻松创建任务特定的分类器。我们的框架随后动态获取有关每个实体的描述性文本,并以此作为生成基于文本的分类器的基础。我们提出了一种新颖的文本获取方法,利用了网络和大型语言模型(LLMs)。我们在两个不同领域的分类问题上评估了我们提出的框架:(i)将组织分类为标准行业分类(SIC)代码,该代码根据组织的商业活动对其进行分类;(ii)将医疗提供者分类为医疗提供者分类代码,该代码表示提供者的医学专长和实践领域。我们表现最佳的模型在SIC代码和医疗分类代码分类任务上的宏平均F1分数分别达到了82.3%和72.9%。
cs.CL / 28 / 2604.22335

Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

上下文保真度提升:通过水印灵感解码增强可信生成
Zhang, Weixu, Ye, Fanghua, Gao, Qiang, Li, Jian, Wu, Haolun, Tian, Yuxing, Duan, Sijing, Du, Nan, Li, Xiaolong, Liu, Xue
Abstract
Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.
Chinese Translation
大型语言模型(LLMs)经常生成与输入上下文提供的信息相矛盾或忽视的信息,这种现象被称为保真度幻觉。在本文中,我们提出了上下文保真度提升(Context-Fidelity Boosting, CFB),这是一种轻量级的通用解码时框架,通过增加源支持令牌的生成概率来减少这种幻觉。CFB受水印技术中的对数调整原则启发,对每个令牌进行基于其来自输入上下文支持程度的附加性令牌级对数调整。具体而言,我们开发了三种提升策略:静态提升(static boosting),对源支持的令牌施加固定偏差;上下文感知提升(context-aware boosting),根据上下文有无下一个令牌分布之间的差异来调整偏差;以及令牌感知提升(token-aware boosting),根据源位置注意力和源范围语义相似性估计的局部相关性进一步重分配自适应偏差。CFB无需重新训练或架构变更,适用于各种大型语言模型。针对多个开源LLM的摘要和问答任务的实验表明,CFB在增加生成开销的情况下,始终改善保真度指标。我们的实现完全开源。
cs.CL / 29 / 2604.22345

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

大规模语言模型中的偏好头:可解释个性化的机制框架
Zhang, Weixu, Yuan, Ye, Han, Changjiang, Tian, Yuxing, Sun, Zipeng, Du, Linfeng, Kang, Jikun, Kang, Hong, Liu, Xue, Wu, Haolun
Abstract
Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.
Chinese Translation
大规模语言模型(LLMs)展现出强大的隐式个性化能力,但大多数现有方法将这种行为视为黑箱,依赖于提示工程或用户数据的微调。在本研究中,我们采用机制可解释性的视角,假设存在一组稀疏的偏好头(Preference Heads),即编码用户特定风格和主题偏好的注意力头,并对生成过程施加因果影响。我们提出了一种无训练的框架——差异偏好引导(Differential Preference Steering, DPS),该框架(1)通过因果掩蔽分析识别偏好头,并(2)在推理时利用这些偏好头进行可控和可解释的个性化。DPS为每个注意力头计算一个偏好贡献分数(Preference Contribution Score, PCS),直接测量其对用户对齐输出的因果影响。在解码过程中,我们对比模型在有偏好头和没有偏好头的预测,放大个性化和通用对数值之间的差异,以选择性地加强与偏好对齐的延续。在多个大规模语言模型的广泛个性化基准上的实验表明,在保持内容一致性和低计算开销的同时,个性化保真度得到了持续提升。除了经验上的改进,DPS还提供了一个机制性解释,说明个性化在变换器架构中是如何以及在何处出现的。我们的实现已公开可用。
cs.CL / 30 / 2604.22367

CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

CNSL-bench:评估多模态大语言模型在中国国家手语理解能力的基准测试
Zhao, Rui, Zhong, Xuewen, Zheng, Xiaoyun, Su, Jinsong, Chen, Yidong
Abstract
Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
Chinese Translation
由于大型语言模型(LLMs)的进步,手语研究取得了显著进展。然而,LLMs 本身在理解手语的内在能力,尤其是在多模态上下文中,仍然未得到深入探讨。为了解决这一局限性,我们提出了 CNSL-bench,这是首个全面的中文国家手语基准,旨在评估多模态大语言模型(MLLMs)在手语理解方面的表现。所提出的 CNSL-bench 具有以下特点:1)权威性基础,基于官方标准化的《国家通用手语词典》,减轻了地区或非规范变体带来的歧义,确保语义定义的一致性;2)多模态覆盖,提供对齐的文本描述、示意图和手语视频;3)发音多样性,支持对关键手势发音形式(包括空写、手指拼写和中文手语字母)的细粒度分析。通过 CNSL-bench,我们广泛评估了21个开源和专有的最新 MLLMs。我们的结果显示,尽管在多模态建模方面有所进展,目前的 MLLMs 在性能上仍显著低于人类表现,在输入模态和手势发音形式之间存在系统性的差异。额外的诊断分析表明,尽管推理能力有所改善,但仍存在若干性能限制,并且不同模型在遵循指令的稳健性方面差异显著。
cs.CL / 31 / 2604.22374

Selective Contrastive Learning For Gloss Free Sign Language Translation

选择性对比学习用于无标记手语翻译
Lai, Changhao, Zhao, Rui, Zhong, Xuewen, Su, Jinsong, Chen, Yidong
Abstract
Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.
Chinese Translation
手语翻译(SLT)将连续的手语视频转换为口语文本,但由于视觉手势与书面文本之间固有的模态不匹配,特别是在无标记的环境下,这一过程仍然具有挑战性。近期的手语翻译系统日益采用类似CLIP的视觉-语言预训练(VLP)进行跨模态对齐,但随机的批内对比提供的负样本数量有限,并且可能错误地将语义相似(甚至相同)的配对标记为负样本,从而引入嘈杂且可能不一致的对齐监督。在本研究中,我们首先进行初步的基于轨迹的分析,追踪训练过程中负视频-文本相似性的变化。结果表明,仅有一小部分负样本表现出一致的被推远的理想行为,而其余负样本则显示出异质并且通常不是单调递减的相似性动态,这表明随机的批内负样本在有效对齐中常常缺乏信息。受到此启发,我们提出了用于手语翻译的选择性对比学习(SCL-SLT)框架,采用了一种配对选择(PS)策略。PS通过参考检查点的相似性动态对候选负样本进行评分,并通过逐步强调更具挑战性的负样本构建小批量,从而增强对比监督,同时减少嘈杂或语义无效负样本的影响。
cs.CL / 32 / 2604.22503

Measuring and Mitigating Persona Distortions from AI Writing Assistance

测量和缓解AI写作辅助带来的角色扭曲
Röttger, Paul, Hackenburg, Kobi, Kirk, Hannah Rose, Summerfield, Christopher
Abstract
Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
Chinese Translation
数亿人使用人工智能(AI)进行写作辅助。在此,我们评估了AI写作辅助如何扭曲作家的角色——即其感知的信念、个性和身份。在三项大规模实验中,作家(N=2,939)分别在有无AI辅助的情况下撰写政治观点段落。不同组别的读者(N=11,091)盲目评估了这些段落在29个社会显著维度上的读者感知,包括政治观点、写作质量、作家个性、情感和人口统计特征。AI写作辅助在所有维度上均产生了角色扭曲:使用AI后,作家似乎显得更有观点、更有能力和更积极,其感知的人口统计特征也向更特权群体倾斜。作家对许多观察到的扭曲表示反对,但即使意识到这些扭曲,他们仍然倾向于选择AI辅助的文本。我们通过在实验数据(10,008段落,2,903,596条评分)上训练奖励模型,成功缓解了模型层面上的不当角色扭曲,以引导AI输出忠实呈现作家的立场。然而,这也导致了用户接受度的降低,暗示了AI写作辅助中可取与不可取特性之间的纠缠,这可能难以解决。总体而言,我们的研究结果表明,AI写作辅助带来的角色扭曲是普遍存在且持续的,即使在现实的人类监督条件下,这对公共话语、信任及伴随AI普及的民主审议具有重要影响。
cs.CL / 33 / 2604.22517

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

商业创意评估中的聚合评审与个性化评审:来自专家意见分歧的证据
Hirota, Wataru, Taniguchi, Tomoki, Ohkuma, Tomoko, Takahashi, Kosuke, Omi, Takahiro, Arima, Kosuke, Asakura, Takuto, Chen, Chung-Chi, Ishigaki, Tatsuya
Abstract
Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
Chinese Translation
评估大型语言模型(LLM)生成的商业创意通常比生成这些创意更难以规模化。与标准自然语言处理(NLP)基准不同,商业创意评估依赖于多维标准,如可行性、新颖性、差异化、用户需求和市场规模,而专家判断往往存在分歧。本文研究了由这种分歧所引发的一个方法论问题:自动评审是否应接近聚合共识,还是应该单独建模评审者?我们引入了PBIG-DATA,这是一个包含大约3000个个体评分的数据集,涵盖300个基于专利的产品创意,由领域专家提供,评估六个商业导向维度:特异性、技术有效性、创新性、竞争优势、需求有效性和市场规模。分析表明,专家对细致的序数评分存在显著分歧,而在粗略选择下则达成更高一致性,这表明是结构性异质性而非随机噪声。然后我们比较了三种评审配置:仅使用评分标准的零-shot评审、基于混合评审者历史的聚合评审,以及基于目标评审者评分历史的个性化评审。在各个维度和模型规模下,个性化评审比聚合评审更贴近相应评审者,且评审者一致性只在个性化条件下与评审生成的推理相似性相关。这些结果表明,在多元化评估环境下,聚合标签可能是一个脆弱的目标,并激励基于评审者条件的评审设计用于商业创意评估。
cs.CL / 34 / 2604.22520

RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

RouteLMT:用于混合大型语言模型翻译部署的学习样本路由
Luo, Yingfeng, Liu, Hongyu, Lin, Dingyang, Chang, Kaiyan, Wang, Chenglong, Li, Bei, Du, Quan, Xiao, Tong, Zhu, Jingbo
Abstract
Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.
Chinese Translation
大型语言模型(LLMs)在机器翻译(MT)中取得了显著的性能,但大规模部署依然成本高昂。一个广泛采用的解决方案是混合系统模式,该模式通过使用小型模型处理大多数请求,并有选择地将部分请求路由到大型模型,以平衡成本与质量。然而,现有的路由策略通常依赖于启发式方法、外部预测器或绝对质量估计,而这些方法未能捕捉到大型模型是否真的在小型模型之上提供了有价值的改进。在本文中,我们将路由制定为预算分配问题,并将边际增益,即大型模型相对于小型模型的改进,确定为预算决策的最佳信号。在此基础上,我们提出了 extbf{RouteLMT}(用于基于LLM的机器翻译的路由),这是一种高效的内部模型路由器,通过探测小型翻译器的提示-标记表示来预测这一预期增益,无需外部模型或假设解码。大量实验表明,我们的RouteLMT在性能上优于启发式方法以及质量/难度估计基线,实现了更优质的质量-预算帕累托前沿。此外,我们分析了回归风险,并展示了一种简单的受保护变体可以减轻严重的质量损失。
cs.CL / 35 / 2604.22542

Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners

可控口语对话生成:为K-12非母语英语学习者打造的基于大型语言模型的评估系统
Yuan, Haidong, Zhao, Haokun, Xu, Wanshi, Cao, Songjun, Zhou, Qingyu, Ma, Long, Fan, Hongjie
Abstract
Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.
Chinese Translation
由于能力不匹配,大型语言模型(LLMs)往往无法满足非母语背景下K-12英语学习者的教学需求。为了解决这一普遍性挑战,我们提出了一种与学习者能力相适应的框架,以中国国家课程标准(CSE)为代表案例,调整LLM输出。我们的框架通过一个四级评分系统实现对词汇复杂度的精确控制,并配备了一整套新的资源:分级词汇表和多轮对话语料库。我们的核心技术贡献是 extbf{DDPO}算法(Diversity Driven Policy Optimization),这是一种基于多轮GRPO的方法,旨在在全面优化对话质量的同时保持对话的多样性。该方法显著优于传统方法,实现了低的词汇外现象率和高的多样性,同时提升了会话的自然性和教学价值。尽管我们框架的基础是CSE,但设计上具有灵活性,可以轻松适应其他教育标准。我们的模型、数据和代码将全部开源,提供一个可扩展的平台,旨在针对K-12学习者在非沉浸式环境中所面临的独特挑战,进行个性化的英语口语练习。
cs.CL / 36 / 2604.22555

Using Embedding Models to Improve Probabilistic Race Prediction

利用嵌入模型改进概率种族预测
Dasanaike, Noan, Imai, Kosuke
Abstract
Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.
Chinese Translation
估计种族差异需要个体级别的种族数据,但由于收集此类信息的敏感性,这些数据往往不可获得。为了解决这一问题,许多研究者采用贝叶斯改进姓氏地理编码(Bayesian Improved Surname Geocoding, BISG),该方法严重依赖于人口普查的姓氏数据。不幸的是,这些数据仅捕捉到常见姓氏的种族与姓氏关系,忽略了大约10%的美国人口。我们发现在这些被排除的、不常见的姓氏个体中,预测性能显著下降,因为标准BISG实现依赖于这些情况下的无信息通用先验。为了应对这一局限性,我们提出了嵌入驱动的BISG(embedding-powered BISG, eBISG),该方法使用预先训练的文本嵌入将姓名表示为密集向量,并根据2020年人口普查的姓氏和名字数据训练神经网络,以估计那些未在人口普查中覆盖的姓名的种族概率。我们比较了五种方法:仅使用姓氏的标准BISG、结合名字概率的BIFSG、针对未列出姓名的姓氏嵌入、结合姓氏和名字的嵌入,以及基于来自南方各州选民文件数据训练的全名嵌入,后者捕捉了姓名成分之间的相互作用。我们证明,每一种后续的eBISG方法都提高了种族预测的准确性,其中全名嵌入的提升效果最大,尤其对于在人口普查名单中缺失的西班牙裔和亚洲选民的预测效果最佳。
cs.CL / 37 / 2604.22565

Learning Evidence Highlighting for Frozen LLMs

为冷冻大型语言模型学习证据强调
Li, Shaoang, Shi, Yanhang, Li, Yufei, Liang, Mingfu, Wei, Xiaohan, Pu, Yunchen, Tian, Fei, Sun, Chonglin, Shyu, Frank, Simon, Luke, Pandey, Sandeep, Liu, Xi, Li, Jian
Abstract
Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
Chinese Translation
大型语言模型(LLMs)能够进行良好的推理,然而在长且嘈杂的上下文中,它们常常会错过关键的证据。我们提出了 HiLight,一个证据强调框架,将证据选择与冷冻 LLM 解算器的推理过程解耦。HiLight 避免压缩或重写输入,因为这可能会丢弃或扭曲证据,它通过训练一个轻量级的强调演员(Emphasis Actor)在未改变的上下文中为关键跨度插入最小的强调标签。然后,冷冻解算器在强调的输入上进行下游推理。我们将突出显示视为一个弱监督的决策问题,并用强化学习来优化演员,仅使用解算器的任务奖励,而不需要证据标签,也不需要访问或修改解算器。在顺序推荐和长上下文问答任务中,HiLight 一直优于强基于提示以及自动提示优化的基线方法。学习到的强调策略能够零样本迁移到更小或更大的未知解算器家族,包括基于 API 的解算器,这表明演员捕获了真实的、可重用的证据结构,而非对单一主干的过拟合。
cs.CL / 38 / 2604.22606

Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube

道德、数据与欺骗:基于大语言模型的YouTube牛尿健康声claim的修辞分析
Munir, Sheza, Kandala, Ratna, Khan, Anamta, Deepti, Pal, Joyojeet
Abstract
Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.
Chinese Translation
健康误信息始终是社交媒体上最紧迫的挑战之一,尤其是在文化传统与听起来科学的主张交汇时。这些动态不仅具有全球性,同时也深创当地性,表现为需要仔细分析的文化特定争议。基于此,我们检查了100个YouTube转录稿,研究其如何宣传或揭穿牛尿(gomutra)作为健康疗法,重点关注修辞策略,如权威诉求、功效诉求和阴谋框架。我们采用包括GPT-4、GPT-4o、GPT-4.1、GPT-5、Gemini 2.5 Pro和Mistral Medium 3在内的大语言模型(LLMs),使用14类说服策略的分类法对转录稿进行注释。我们的分析显示,宣传者主要依赖于功效诉求和社交证明,而揭穿者则强调权威和反驳。对部分注释的人工评估显示出90.1%的注释者间一致性,确认了我们的分类法和验证过程的可靠性。这项工作推动了误信息分析的计算方法,并展示了大语言模型如何支持大规模的文化话语在线研究。
cs.CL / 39 / 2604.22626

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

从图形依赖到词汇结构:用马尔可夫视角分析但丁的《神曲》
Sabatini, Angelo Maria
Abstract
This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.
Chinese Translation
本研究通过基于元音-辅音(V/C)编码的符号表示,探讨了但丁的《神曲》的结构组织。将得到的序列建模为四状态马尔可夫链,产生了一个简约的图形记忆指数,捕捉到持久性和交替模式之间的平衡。在整首诗中,随着从《地狱》到《天堂》的变化,该指数表现出轻微但持续的上升,指示了局部依赖结构的方向性变化。三元组级分析显示,这一趋势是由一组有限的重复配置驱动的,这些配置被解释为图形探针,将马尔可夫表示与文本中可识别的词汇环境联系起来。这些探针展现出不同的行为:涉及两个转变的配置更频繁地出现在词边界内,反映了相邻标记之间的相互作用,而转变较少的配置则主要限制于词内结构。部分信号受正字法现象的进一步影响,特别是省略形式,突显了书写习惯在语音和词汇组织中的作用。一项互补的分类分析识别出与各个歌曲相关的特定术语,提供了词汇锚点,使图形探针能够与诗的结构联系起来。这种组织不仅在三个歌曲的分离中得到反映,还在文本的连续轨迹中显现。总的来说,结果表明,将简单的概率模型应用于符号文本表示可以揭示局部依赖、词汇分布、正字法编码和大规模组织之间的结构化互动,为将局部符号动态与更高层次的文本组织联系提供了一个可解释的框架。
cs.CL / 40 / 2604.22631

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

识别和分类自监督语音识别模型中音素级嵌入的群体不公平性
Herron, Felix, Rossato, Solange, Allauzen, Alexandre, Portet, François
Abstract
Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.
Chinese Translation
现代自动语音识别(ASR)系统已被观察到在某些说话者群体(SGs)中表现优于其他群体,尽管近期整体性能有所提高。朝着更公平的ASR进展的一个潜在障碍是对语音编码模型所产生的建模错误类型的更为细致的理解,特别是高性能和低性能SGs之间嵌入结构的差异。本文提出了一个框架,分类ASR系统中音素建模可能发生的两种类型错误:随机错误/高方差音素嵌入,与系统错误/嵌入偏差。我们发现,仅在单一的、通常处于劣势的SG上训练音素分类探针,有时会提升该SG的性能,这是音素嵌入中SG级别偏见存在的证据。另一方面,我们发现拥有较高音素方差的说话者和SGs与那些音素预测准确性较差的相同。我们得出结论,音素嵌入中存在这两种类型的错误,且两者都是导致ASR领域中SG级别不公平的候选原因,尽管随机错误可能对公平性造成的阻碍大于系统错误。此外,我们发现,通过使用一种增强公平性的算法(领域增强和对抗训练)微调编码器模型,并未改变领域内音素分类探针训练的收益,也未改变随机嵌入错误的测量水平。
cs.CL / 41 / 2604.22678

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

BERAG:用于基于知识的视觉问答的贝叶斯集成检索增强生成
Chen, Jinghong, Mei, Jingbiao, Yang, Guangyu, Byrne, Bill
Abstract
A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the ``lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the ``lost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.
Chinese Translation
一种常见的使用检索增强生成(RAG)进行问答的方法是将文档串联成一个单一的上下文,并将其传递给语言模型以生成答案。尽管这种方法简单,但它可能掩盖各个文档的贡献,使得归属变得困难,并导致“迷失于中间”效应,即在长上下文中相关信息被忽略。串联在扩展性上也存在问题:计算成本随着上下文长度的平方增长,当上下文包括视觉数据时,更是严重,如在视觉问答中。通过限制上下文长度来缓解这些问题的尝试可能进一步限制表现,因为这会防止模型从更深层次的检索中受益。我们提出了贝叶斯集成检索增强生成(BERAG)以及贝叶斯集成微调(BEFT),作为一种RAG框架,其中语言模型基于单独检索到的文档而非单个组合上下文来进行条件生成。BERAG将文档的后验概率视为集成权重,并在生成过程中通过贝叶斯规则逐个符号进行更新。这种方法实现了概率重排名、并行内存使用和文档贡献的清晰归属,使其非常适合处理大型文档集合。我们主要在基于知识的视觉问答任务上评估BERAG和BEFT,其中模型必须在长且不完美的检索列表上进行推理。结果显示,相对于标准RAG,取得了显著的提升,包括在文档视觉问答和多模态“针在干草堆”基准测试上有明显的进步。我们还证明了BERAG能够减轻“迷失于中间”效应。文档后验可以用来检测不足的基础并触发偏转,而文档剪枝使得解码速度比标准RAG更快。
cs.CL / 42 / 2604.22693

CRAFT: Clustered Regression for Adaptive Filtering of Training data

CRAFT:用于训练数据自适应过滤的聚类回归
Panda, Parthasarathi, Swain, Asheswari, Panda, Subhrakanta
Abstract
Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.
Chinese Translation
随着语料库的规模增长至数千万数据点,从中选择一个小而高质量的子集进行微调变得愈发重要,这使得全面微调变得昂贵且通常不必要。我们提出了CRAFT(Clustered Regression for Adaptive Filtering of Training data),这是一种与向量化无关的训练序列到序列模型的选择方法。CRAFT将源-目标联合分布进行分解,并执行两阶段选择:(i) 通过在k-means聚类中按比例分配预算来匹配验证源分布,(ii) 在每个源聚类内,选择目标嵌入最小化从验证目标分布派生的条件期望距离的训练对。我们证明了比例聚类分配界定了选定分布与验证分布之间的连续KL散度,其残差受聚类直径的控制。我们在英语-印地语翻译任务上评估CRAFT,通过从3300万个NLLB句子对中选择训练数据,并通过LoRA微调mBART。CRAFT达到了43.34的BLEU分数,相较于相同候选池和编码器的TSDS(41.21分)提高了2.13分,同时选择过程的速度超过40倍。在使用TF-IDF向量化的情况下,整个流程在CPU上完成不到一分钟。TAROT的BLEU分数为45.61,但CRAFT选择完成的时间为26.86秒,而TAROT则需要75.6秒,实现了2.8倍的加速。
cs.CL / 43 / 2604.22709

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

无言思考:高效的抽象思维潜在推理
Ramji, Keshav, Naseem, Tahira, Astudillo, Ramón Fernandez
Abstract
While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
Chinese Translation
尽管长而明确的思维链(Chain-of-Thought,CoT)在复杂推理任务中证明了其有效性,但在推理过程中生成这些思维链的成本较高。非语言推理方法通过利用连续表示,推出了更短的生成长度,但其性能仍落后于语言化的 CoT。我们提出了 $ extbf{抽象思维链}$(Abstract Chain-of-Thought),这是一种离散潜在推理后训练机制,其中语言模型在生成响应之前,从保留的词汇表中生成一短串代币,而不是自然语言的 CoT。为了使先前未见的“抽象”代币变得有用,我们引入了一种类似政策迭代的热身循环,交替进行(i)通过掩蔽从语言化 CoT 中瓶颈并进行监督微调,以及(ii)通过训练模型仅使用约束解码与代码本从提示中生成抽象代币进行自蒸馏。热身后,我们在约束解码下,利用热启动的强化学习优化抽象序列的生成。抽象思维链(Abstract-CoT)在数学推理、遵循指令和多跳推理等方面表现出与语言化 CoT 相当的性能,同时推理代币数量减少至最多 $11.6 imes$,并能够在不同语言模型家族中进行推广。我们还发现抽象词汇表上出现了类似于自然语言的幂律分布,并在训练阶段不断演变。我们的研究结果强调了后训练潜在推理机制的潜力,这些机制通过学习的抽象推理语言实现高效推理。
cs.CL / 44 / 2604.22749

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

大语言模型生成叙事中对全球多数国籍的表现性伤害
Nguyen, Ilana, Suresh, Harini, Monroe-White, Thema, Shieh, Evan
Abstract
Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.
Chinese Translation
大语言模型(LLMs)正日益被用于从日常使用到高风险企业和政府应用的文本生成任务,包括与寻求庇护者的模拟访谈。虽然许多研究强调了LLMs的新潜在应用,但LLMs可能会对全球非主流群体编码和延续有害偏见,存在一定的风险。为了更好地评估和减轻这些伤害,需要更多研究来考察LLMs如何描绘多样化的个体。在本研究中,我们分析了主流大语言模型在回应开放式叙事生成提示时如何表现国籍身份。我们的发现显示,按国籍划分的表现性伤害是持续存在的,包括有害的刻板印象、抹去以及对全球多数身份的单维描绘。边缘化的国家身份在权力中立的故事中同时被表现得不足,而在从属角色的描绘中却被过度表现,其出现的可能性是主流描绘的五十倍以上。当输入提示中出现美国国籍提示(例如,“美国人”)时,伤害程度进一步加剧。值得注意的是,我们发现所识别出的伤害不能仅归因于阿谀奉承,因为即便将美国国籍提示替换为非美国国籍,亲美的偏见依然存在。基于我们的研究结果,我们呼吁通过关注全球多数视角并挑战对以美国为基础的LLMs的无批判采用,以进一步探讨LLMs中的文化伤害,这对于分类、监视和误表述我们星球绝大多数人群至关重要。
cs.CL / 45 / 2604.22750

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

人工智能代理如何花费您的资金?分析和预测代理编码任务中的令牌消耗
Bai, Longju, Huang, Zhemin, Wang, Xingyao, Sun, Jiao, Mihalcea, Rada, Brynjolfsson, Erik, Pentland, Alex, Pei, Jiaxin
Abstract
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
Chinese Translation
人工智能代理在复杂人类工作流程中的广泛采用正在推动大规模语言模型(LLM)令牌消耗的迅速增长。当代理被部署在需要大量令牌的任务上时,自然会出现三个问题:(1)人工智能代理在哪里花费令牌?(2)哪些模型更具令牌效率?以及(3)代理在任务执行之前能否预测其令牌使用?在本文中,我们首次系统性地研究了代理编码任务中的令牌消耗模式。我们分析了来自八个前沿 LLM 在 SWE-bench Verified 上的轨迹,并评估模型在任务执行之前预测自身令牌成本的能力。我们的发现包括:(1)代理任务的成本极高,消耗的令牌数量比代码推理和代码聊天多出1000倍,整体成本主要由输入令牌而非输出令牌驱动;(2)令牌使用高度可变且固有随机:对同一任务的运行在总令牌数上可相差高达30倍,且更高的令牌使用并不直接对应更高的准确率;相反,准确率往往在中等成本时达到峰值,随后在更高的成本下饱和;(3)在令牌效率方面,模型之间存在显著差异:在相同任务下,Kimi-K2 和 Claude-Sonnet-4.5 的平均令牌消耗比 GPT-5 多出超过150万;(4)人类专家对任务难度的评分与实际令牌成本仅弱相关,揭示了人类感知复杂性与代理实际消耗的计算努力之间的根本差距;(5)前沿模型未能准确预测自身的令牌使用(相关性较弱至中等,最高为0.39),并系统性地低估了实际的令牌成本。我们的研究为人工智能代理的经济学提供了新的见解,并可以激发未来在该方向的研究。