arXiv Daily Digest

186

Papers

Adaptive Time Step Flow Matching for Autonomous Driving Motion Planning

自适应时间步流匹配用于自主驾驶运动规划

Trivedi, Ananya, Li, Anjian, Elnoor, Mohamed, Ciftci, Yusuf Umut, Singh, Avinash, D'sa, Jovin, Bae, Sangjae, Isele, David, Padir, Taskin, Tariq, Faizan M.

Abstract

Autonomous driving requires reasoning about interactions with surrounding traffic. A prevailing approach is large-scale imitation learning on expert driving datasets, aimed at generalizing across diverse real-world scenarios. For online trajectory generation, such methods must operate at real-time rates. Diffusion models require hundreds of denoising steps at inference, resulting in high latency. Consistency models mitigate this issue but rely on carefully tuned noise schedules to capture the multimodal action distributions common in autonomous driving. Adapting the schedule, typically requires expensive retraining. To address these limitations, we propose a framework based on conditional flow matching that jointly predicts future motions of surrounding agents and plans the ego trajectory in real time. We train a lightweight variance estimator that selects the number of inference steps online, removing the need for retraining to balance runtime and imitation learning performance. To further enhance ride quality, we introduce a trajectory post-processing step cast as a convex quadratic program, with negligible computational overhead. Trained on the Waymo Open Motion Dataset, the framework performs maneuvers such as lane changes, cruise control, and navigating unprotected left turns without requiring scenario-specific tuning. Our method maintains a 20 Hz update rate on an NVIDIA RTX 3070 GPU, making it suitable for online deployment. Compared to transformer, diffusion, and consistency model baselines, we achieve improved trajectory smoothness and better adherence to dynamic constraints. Experiment videos and code implementations can be found at https://flow-matching-self-driving.github.io/.

Chinese Translation

自主驾驶需要考虑与周围交通的互动。一种普遍的方法是在专家驾驶数据集上进行大规模模仿学习，旨在跨越多样的现实场景进行泛化。对于在线轨迹生成，这类方法必须以实时速率运行。扩散模型在推理时需要数百个去噪步骤，导致高延迟。一致性模型缓解了这一问题，但依赖于精心调整的噪声调度，以捕捉自主驾驶中常见的多模态动作分布。调整调度通常需要昂贵的重新训练。为了解决这些限制，我们提出了一种基于条件流匹配的框架，该框架实时联合预测周围代理的未来运动并规划自我轨迹。我们训练了一个轻量级方差估计器，在线选择推理步骤的数量，消除了为了平衡运行时间和模仿学习性能而进行重新训练的需要。为了进一步提升乘坐质量，我们引入了一个轨迹后处理步骤，形式为一个凸二次规划，计算开销微乎其微。在Waymo开放运动数据集上训练后，该框架能够执行车道变换、巡航控制和导航无保护左转等机动，而无需特定场景的调优。我们的方法在NVIDIA RTX 3070 GPU上保持20 Hz的更新率，适合在线部署。与变换器、扩散模型和一致性模型基线相比，我们实现了更好的轨迹平滑性和对动态约束的更好遵循。实验视频和代码实现可在 https://flow-matching-self-driving.github.io/ 找到。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.10289

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

一种基于人机协作的信心感知模块化机器人策略故障恢复框架

Banerjee, Rohan, Palempalli, Krishna, Yang, Bohan, Fang, Jiaying, Abdullah, Alif, Silver, Tom, Dean, Sarah, Bhattacharjee, Tapomayukh

Abstract

Robots operating in unstructured human environments inevitably encounter failures, especially in robot caregiving scenarios. While humans can often help robots recover, excessive or poorly targeted queries impose unnecessary cognitive and physical workload on the human partner. We present a human-in-the-loop failure-recovery framework for modular robotic policies, where a policy is composed of distinct modules such as perception, planning, and control, any of which may fail and often require different forms of human feedback. Our framework integrates calibrated estimates of module-level uncertainty with models of human intervention cost to decide which module to query and when to query the human. It separates these two decisions: a module selector identifies the module most likely responsible for failure, and a querying algorithm determines whether to solicit human input or act autonomously. We evaluate several module-selection strategies and querying algorithms in controlled synthetic experiments, revealing trade-offs between recovery efficiency, robustness to system and user variables, and user workload. Finally, we deploy the framework on a robot-assisted bite acquisition system and demonstrate, in studies involving individuals with both emulated and real mobility limitations, that it improves recovery success while reducing the workload imposed on users. Our results highlight how explicitly reasoning about both robot uncertainty and human effort can enable more efficient and user-centered failure recovery in collaborative robots. Supplementary materials and videos can be found at: http://emprise.cs.cornell.edu/modularhil

Chinese Translation

在非结构化的人类环境中操作的机器人不可避免地会遇到故障，尤其是在机器人护理场景中。虽然人类通常可以帮助机器人恢复，但过多或不当的询问会对人类伙伴施加不必要的认知和身体负担。我们提出了一种基于人机协作的故障恢复框架，适用于模块化机器人策略，其中策略由感知、规划和控制等不同模块组成，这些模块中的任何一个都可能失败，并且通常需要不同形式的人类反馈。我们的框架将模块级不确定性的校准估计与人类干预成本模型相结合，以决定询问哪个模块以及何时询问人类。它将这两个决策分开：模块选择器识别最可能导致故障的模块，而询问算法则决定是请求人类输入还是自主行动。我们在受控的合成实验中评估了几种模块选择策略和询问算法，揭示了恢复效率、对系统和用户变量的鲁棒性以及用户工作负载之间的权衡。最后，我们在一个机器人辅助的咬合获取系统上部署了该框架，并在涉及模拟和真实运动限制的个体的研究中证明，它提高了恢复成功率，同时减少了对用户施加的工作负载。我们的结果强调了明确考虑机器人不确定性和人类努力如何能够促进协作机器人中更高效和以用户为中心的故障恢复。补充材料和视频可在以下网址找到：http://emprise.cs.cornell.edu/modularhil

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.10365

Solving Geodesic Equations with Composite Bernstein Polynomials for Trajectory Planning

基于复合伯恩斯坦多项式的测地线方程求解用于轨迹规划

Gorman, Nick, MacLin, Gage, Hammond, Maxwell, Cichella, Venanzio

Abstract

This work presents a trajectory planning method based on composite Bernstein polynomials for autonomous systems navigating complex environments. The method is implemented in a symbolic optimization framework that enables continuous paths and precise control over trajectory shape. Trajectories are planned over a cost surface that encodes obstacles as continuous fields rather than discrete boundaries. Regions near obstacles are assigned higher costs, naturally encouraging the trajectory to maintain a safe distance while still allowing efficient routing through constrained spaces. The use of composite Bernstein polynomials preserves continuity while enabling fine control over local curvature to satisfy geodesic constraints. The symbolic representation supports exact derivatives, improving optimization efficiency. The method applies to both two- and three-dimensional environments and is suitable for ground, aerial, underwater, and space systems. In spacecraft trajectory planning, for example, it enables the generation of continuous, dynamically feasible trajectories with high numerical efficiency, making it well suited for orbital maneuvers, rendezvous and proximity operations, cluttered gravitational environments, and planetary exploration missions with limited onboard computational resources. Demonstrations show that the approach efficiently generates smooth, collision-free paths in scenarios with multiple obstacles, maintaining clearance without extensive sampling or post-processing. The optimization incorporates three constraint types: (1) a Gaussian surface inequality enforcing minimum obstacle clearance; (2) geodesic equations guiding the path along locally efficient directions on the cost surface; and (3) boundary constraints enforcing fixed start and end conditions. The method can serve as a standalone planner or as an initializer for more complex motion planning problems.

Chinese Translation

本研究提出了一种基于复合伯恩斯坦多项式的轨迹规划方法，旨在帮助自主系统在复杂环境中导航。该方法在一个符号优化框架中实现，能够生成连续路径并精确控制轨迹形状。轨迹规划是在一个成本表面上进行的，该表面将障碍物编码为连续场而非离散边界。靠近障碍物的区域被分配更高的成本，自然鼓励轨迹保持安全距离，同时允许在受限空间中高效路径规划。复合伯恩斯坦多项式的使用保持了连续性，同时能够精细控制局部曲率，以满足测地线约束。符号表示支持精确导数，提高了优化效率。该方法适用于二维和三维环境，适合地面、空中、水下和太空系统。例如，在航天器轨迹规划中，它能够生成连续的、动态可行的轨迹，并具有高数值效率，非常适合轨道机动、会合与接近操作、复杂引力环境以及在有限的机载计算资源下进行的行星探索任务。演示表明，该方法能够在多个障碍物的场景中高效生成平滑、无碰撞的路径，保持安全间距，而无需大量采样或后处理。优化过程包含三种约束类型：(1) 强制最小障碍物间距的高斯表面不等式；(2) 指导路径沿成本表面局部高效方向的测地线方程；(3) 强制固定起始和结束条件的边界约束。该方法可以作为独立的规划器，也可以作为更复杂运动规划问题的初始化器。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.10399

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

LocoVLM：将视觉与语言结合以适应多功能腿部运动策略

Nahrendra, I Made Aswin, Lee, Seunghyun, Lee, Dongkyu, Myung, Hyun

Abstract

Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To address this limitation, we propose a novel approach that integrates high-level commonsense reasoning from foundation models into the process of legged locomotion adaptation. Specifically, our method utilizes a pre-trained large language model to synthesize an instruction-grounded skill database tailored for legged robots. A pre-trained vision-language model is employed to extract high-level environmental semantics and ground them within the skill database, enabling real-time skill advisories for the robot. To facilitate versatile skill control, we train a style-conditioned policy capable of generating diverse and robust locomotion skills with high fidelity to specified styles. To the best of our knowledge, this is the first work to demonstrate real-time adaptation of legged locomotion using high-level reasoning from environmental semantics and instructions with instruction-following accuracy of up to 87% without the need for online query to on-the-cloud foundation models.

Chinese Translation

近年来，腿部运动学习的进展仍然主要依赖于环境的几何表示，这限制了机器人对更高层次语义（如人类指令）的响应能力。为了解决这一局限性，我们提出了一种新颖的方法，将基础模型中的高层次常识推理整合到腿部运动适应的过程中。具体而言，我们的方法利用预训练的大型语言模型合成一个针对腿部机器人的指令基础技能数据库。我们采用预训练的视觉-语言模型提取高层次环境语义，并将其嵌入技能数据库中，从而为机器人提供实时技能建议。为了促进多样化的技能控制，我们训练了一种风格条件策略，能够生成多样且稳健的运动技能，并在指定风格上保持高保真度。据我们所知，这是首个展示使用环境语义和指令的高层次推理实现腿部运动实时适应的研究，且在不需要在线查询云端基础模型的情况下，指令遵循准确率高达87%。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.10503

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

迈向长寿命机器人：通过强化微调实现的持续学习VLA模型

Liu, Yuan, Li, Haoran, Tian, Shuai, Qin, Yuxing, Chen, Yuhui, Zheng, Yupeng, Huang, Yongzhen, Zhao, Dongbin

Abstract

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.

Chinese Translation

经过在大规模和多样化数据集上的预训练，VLA模型作为通用机器人策略展示了强大的泛化能力和适应性。然而，监督微调（SFT）作为将VLA适应于下游领域的主要机制，需要大量特定任务的数据，并且容易出现灾难性遗忘。为了解决这些局限性，我们提出了LifeLong-RFT，这是一种简单而有效的强化微调（RFT）策略，独立于在线环境反馈和预训练奖励模型。通过将分块级别的在线强化学习与提出的多维过程奖励（MDPR）机制相结合，LifeLong-RFT量化了在三个维度上中间动作块的异质贡献，以促进策略优化。具体而言，(1) 量化动作一致性奖励（QACR）确保在离散动作空间内的准确动作预测；(2) 连续轨迹对齐奖励（CTAR）将解码的连续动作块与参考轨迹对齐，以确保精确控制；(3) 格式合规奖励（FCR）保证输出的结构有效性。在SimplerEnv、LIBERO和真实世界任务上的全面实验表明，LifeLong-RFT在多任务学习中表现出色。此外，在LIBERO基准上的持续学习中，我们的方法在平均成功率上比SFT提高了22%，同时仅使用20%的训练数据有效适应新任务。总体而言，我们的方法为VLA提供了一种有前景的后训练范式。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.10514

Co-jump: Cooperative Jumping with Quadrupedal Robots via Multi-Agent Reinforcement Learning

Co-jump：通过多智能体强化学习实现四足机器人协作跳跃

Dong, Shihao, Chen, Yeke, Luo, Zeren, Zhang, Jiahui, Xu, Bowen, Lin, Jinghan, Han, Yimin, Ma, Ji, Yu, Zhiyou, Zhao, Yudong, Lu, Peng

Abstract

While single-agent legged locomotion has witnessed remarkable progress, individual robots remain fundamentally constrained by physical actuation limits. To transcend these boundaries, we introduce Co-jump, a cooperative task where two quadrupedal robots synchronize to execute jumps far beyond their solo capabilities. We tackle the high-impulse contact dynamics of this task under a decentralized setting, achieving synchronization without explicit communication or pre-specified motion primitives. Our framework leverages Multi-Agent Proximal Policy Optimization (MAPPO) enhanced by a progressive curriculum strategy, which effectively overcomes the sparse-reward exploration challenges inherent in mechanically coupled systems. We demonstrate robust performance in simulation and successful transfer to physical hardware, executing multi-directional jumps onto platforms up to 1.5 m in height. Specifically, one of the robots achieves a foot-end elevation of 1.1 m, which represents a 144% improvement over the 0.45 m jump height of a standalone quadrupedal robot, demonstrating superior vertical performance. Notably, this precise coordination is achieved solely through proprioceptive feedback, establishing a foundation for communication-free collaborative locomotion in constrained environments.

Chinese Translation

尽管单一智能体的腿部运动已经取得了显著进展，但个体机器人仍然受到物理驱动限制的根本约束。为了超越这些限制，我们提出了Co-jump，这是一项协作任务，其中两只四足机器人同步执行跳跃，远超其单独能力。我们在去中心化的环境下处理这一任务的高冲击接触动态，实现了无需明确通信或预设运动原语的同步。我们的框架利用了多智能体近端策略优化（Multi-Agent Proximal Policy Optimization, MAPPO），并通过渐进式课程策略进行增强，有效克服了机械耦合系统固有的稀疏奖励探索挑战。我们在仿真中展示了强大的性能，并成功转移到物理硬件上，能够在高达1.5米的平台上执行多方向跳跃。具体而言，其中一只机器人达到了1.1米的足端高度，相较于单独四足机器人0.45米的跳跃高度提高了144%，展现了卓越的垂直表现。值得注意的是，这种精确的协调完全依赖于本体感觉反馈，为在受限环境中实现无通信的协作运动奠定了基础。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.10547

ReSPEC: A Framework for Online Multispectral Sensor Reconfiguration in Dynamic Environments

ReSPEC：动态环境中在线多光谱传感器重配置框架

Liu, Yanchen, Fan, Yuang, Zhao, Minghui, Jiang, Xiaofan

Abstract

Multi-sensor fusion is central to robust robotic perception, yet most existing systems operate under static sensor configurations, collecting all modalities at fixed rates and fidelity regardless of their situational utility. This rigidity wastes bandwidth, computation, and energy, and prevents systems from prioritizing sensors under challenging conditions such as poor lighting or occlusion. Recent advances in reinforcement learning (RL) and modality-aware fusion suggest the potential for adaptive perception, but prior efforts have largely focused on re-weighting features at inference time, ignoring the physical cost of sensor data collection. We introduce a framework that unifies sensing, learning, and actuation into a closed reconfiguration loop. A task-specific detection backbone extracts multispectral features (e.g. RGB, IR, mmWave, depth) and produces quantitative contribution scores for each modality. These scores are passed to an RL agent, which dynamically adjusts sensor configurations, including sampling frequency, resolution, sensing range, and etc., in real time. Less informative sensors are down-sampled or deactivated, while critical sensors are sampled at higher fidelity as environmental conditions evolve. We implement and evaluate this framework on a mobile rover, showing that adaptive control reduces GPU load by 29.3\% with only a 5.3\% accuracy drop compared to a heuristic baseline. These results highlight the potential of resource-aware adaptive sensing for embedded robotic platforms.

Chinese Translation

多传感器融合是稳健机器人感知的核心，但大多数现有系统在静态传感器配置下运行，以固定的速率和保真度收集所有模态，而不考虑其情境效用。这种刚性浪费了带宽、计算和能量，并阻碍了系统在光照不足或遮挡等挑战性条件下优先考虑传感器的能力。最近在强化学习（RL）和模态感知融合方面的进展表明了自适应感知的潜力，但以往的努力主要集中在推理时重新加权特征上，忽视了传感器数据收集的物理成本。我们提出了一个将感知、学习和执行统一为闭环重配置的框架。特定任务的检测骨干提取多光谱特征（例如RGB、红外、毫米波、深度），并为每种模态生成定量贡献评分。这些评分被传递给RL代理，后者实时动态调整传感器配置，包括采样频率、分辨率、感知范围等。信息量较少的传感器被降采样或停用，而关键传感器则在环境条件变化时以更高的保真度进行采样。我们在移动探测器上实现并评估了该框架，结果显示自适应控制在与启发式基线相比仅有5.3%的准确率下降的情况下，减少了29.3%的GPU负载。这些结果突显了资源感知自适应感知在嵌入式机器人平台上的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.10556

LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

LAP：语言-动作预训练实现零-shot跨体现转移

Zha, Lihan, Hancock, Asher J., Zhang, Mingtong, Yin, Tenny, Huang, Yixuan, Shah, Dhruv, Ren, Allen Z., Majumdar, Anirudha

Abstract

A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.

Chinese Translation

机器人技术中的一个长期目标是实现一种通用策略，该策略能够在新的机器人体现上零-shot部署，而无需针对每个体现进行适应。尽管进行了大规模的多体现预训练，现有的视觉-语言-动作模型（VLA）仍然与其训练体现紧密耦合，通常需要昂贵的微调。我们提出了语言-动作预训练（LAP），这是一种简单的方法，直接用自然语言表示低级机器人动作，将动作监督与预训练的视觉-语言模型的输入-输出分布对齐。LAP不需要学习的分词器、不需要昂贵的标注，也不需要特定于体现的架构设计。基于LAP，我们提出了LAP-3B，据我们所知，这是第一个在没有任何特定于体现的微调的情况下，实现对之前未见过的机器人体现的显著零-shot转移的VLA。在多个新型机器人和操作任务中，LAP-3B的平均零-shot成功率超过50%，相比于最强的先前VLA实现了大约2倍的提升。我们进一步展示了LAP能够实现高效适应和良好的扩展，同时在共享的语言-动作格式中统一动作预测和视觉问答（VQA），通过共同训练获得额外的收益。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.10561

Morphogenetic Assembly and Adaptive Control for Heterogeneous Modular Robots

异构模块机器人形态组装与自适应控制

Meng, Chongxi, Zhao, Da, Zhao, Yifei, Zeng, Minghao, Zhou, Yanmin, Wang, Zhipeng, He, Bin

Abstract

This paper presents a closed-loop automation framework for heterogeneous modular robots, covering the full pipeline from morphological construction to adaptive control. In this framework, a mobile manipulator handles heterogeneous functional modules including structural, joint, and wheeled modules to dynamically assemble diverse robot configurations and provide them with immediate locomotion capability. To address the state-space explosion in large-scale heterogeneous reconfiguration, we propose a hierarchical planner: the high-level planner uses a bidirectional heuristic search with type-penalty terms to generate module-handling sequences, while the low level planner employs A* search to compute optimal execution trajectories. This design effectively decouples discrete configuration planning from continuous motion execution. For adaptive motion generation of unknown assembled configurations, we introduce a GPU accelerated Annealing-Variance Model Predictive Path Integral (MPPI) controller. By incorporating a multi stage variance annealing strategy to balance global exploration and local convergence, the controller enables configuration-agnostic, real-time motion control. Large scale simulations show that the type-penalty term is critical for planning robustness in heterogeneous scenarios. Moreover, the greedy heuristic produces plans with lower physical execution costs than the Hungarian heuristic. The proposed annealing-variance MPPI significantly outperforms standard MPPI in both velocity tracking accuracy and control frequency, achieving real time control at 50 Hz. The framework validates the full-cycle process, including module assembly, robot merging and splitting, and dynamic motion generation.

Chinese Translation

本文提出了一种针对异构模块机器人的闭环自动化框架，涵盖了从形态构建到自适应控制的完整流程。在该框架中，一个移动操控器处理包括结构模块、关节模块和轮式模块在内的异构功能模块，以动态组装多样的机器人配置并赋予其即时的运动能力。为了解决大规模异构重配置中的状态空间爆炸问题，我们提出了一种分层规划器：高层规划器使用带有类型惩罚项的双向启发式搜索生成模块处理序列，而低层规划器则采用A*搜索计算最优执行轨迹。该设计有效地将离散配置规划与连续运动执行解耦。为了对未知组装配置进行自适应运动生成，我们引入了一种基于GPU加速的退火方差模型预测路径积分（MPPI）控制器。通过结合多阶段方差退火策略以平衡全局探索与局部收敛，该控制器实现了与配置无关的实时运动控制。大规模仿真表明，类型惩罚项对于异构场景中的规划鲁棒性至关重要。此外，贪婪启发式生成的计划在物理执行成本上低于匈牙利启发式。所提出的退火方差MPPI在速度跟踪精度和控制频率上显著优于标准MPPI，实现了50 Hz的实时控制。该框架验证了包括模块组装、机器人合并与拆分以及动态运动生成在内的全周期过程。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.10594

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

基于流的少样本模仿学习中的人类示范泛化

Tang, Runze, Sweetser, Penny

Abstract

Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.

Chinese Translation

模仿学习（Imitation Learning, IL）使机器人能够从示范中学习复杂技能，而无需明确的任务建模，但通常需要大量的示范，这带来了显著的收集成本。先前的研究探讨了使用流（flow）作为中间表示，以便使用人类视频作为替代，从而减少所需的机器人示范数量。然而，大多数先前的工作集中在流上，或者关注于物体，或者关注于机器人/手的特定点，这无法描述交互的运动。同时，依赖流来实现对仅在人类视频中观察到的场景的泛化仍然有限，因为单靠流无法捕捉精确的运动细节。此外，基于场景观察来生成精确动作可能导致流条件策略对训练任务的过拟合，从而削弱流所指示的泛化能力。为了解决这些问题，我们提出了SFCrP，其中包括用于跨体现学习的场景流预测模型（Scene Flow prediction model for Cross-embodiment learning, SFCr）和基于流和裁剪点云的条件策略（Flow and Cropped point cloud conditioned Policy, FCrP）。SFCr从机器人和人类视频中学习，并预测任意点的轨迹。FCrP遵循一般流动运动，并根据观察调整动作以实现精确任务。我们的方法在各种真实世界任务设置中超越了最先进的基准，同时在仅在人类视频中看到的场景中表现出强大的空间和实例泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.10610

Pitch Angle Control of a Magnetically Actuated Capsule Robot with Nonlinear FEA-based MPC and EKF Multisensory Fusion

基于非线性有限元分析的磁驱动胶囊机器人俯仰角控制：模型预测控制与扩展卡尔曼滤波多传感器融合

Wang, Chongxun, Shen, Zikang, Rathore, Apoorav, Udombeh, Akanimoh, Teng, Harrison, Xia, Fangzhou

Abstract

Magnetically actuated capsule robots promise minimally invasive diagnosis and therapy in the gastrointestinal (GI) tract, but existing systems largely neglect control of capsule pitch, a degree of freedom critical for contact-rich interaction with inclined gastric walls. This paper presents a nonlinear, model-based framework for magnetic pitch control of an ingestible capsule robot actuated by a four-coil electromagnetic array. Angle-dependent magnetic forces and torques acting on embedded permanent magnets are characterized using three-dimensional finite-element simulations and embedded as lookup tables in a control-oriented rigid-body pitching model with rolling contact and actuator dynamics. A constrained model predictive controller (MPC) is designed to regulate pitch while respecting hardware-imposed current and slew-rate limits. Experiments on a compliant stomach-inspired surface demonstrate robust pitch reorientation from both horizontal and upright configurations, achieving about three to five times faster settling and reduced oscillatory motion than on-off control. Furthermore, an extended Kalman filter (EKF) fusing inertial sensing with intermittent visual measurements enables stable closed-loop control when the camera update rate is reduced from 30 Hz to 1 Hz, emulating clinically realistic imaging constraints. These results establish finite-element-informed MPC with sensor fusion as a scalable strategy for pitch regulation, controlled docking, and future multi-degree-of-freedom capsule locomotion.

Chinese Translation

磁驱动胶囊机器人在胃肠道（GI）中提供了微创诊断和治疗的前景，但现有系统在很大程度上忽视了胶囊俯仰角的控制，而这一自由度对于与倾斜的胃壁进行丰富接触的交互至关重要。本文提出了一种非线性、基于模型的框架，用于控制由四线圈电磁阵列驱动的可吞咽胶囊机器人的磁性俯仰角。通过三维有限元仿真对作用于嵌入式永久磁铁的角度依赖性磁力和力矩进行表征，并将其嵌入到一个控制导向的刚体俯仰模型中，该模型考虑了滚动接触和驱动器动态。设计了一种受限的模型预测控制器（MPC），以在遵循硬件施加的电流和变化率限制的同时调节俯仰角。在一个受顺应性启发的胃表面上的实验展示了从水平和直立配置中进行稳健的俯仰重新定向，达到约三到五倍于开关控制的更快稳定时间和减少的振荡运动。此外，结合惯性传感器与间歇性视觉测量的扩展卡尔曼滤波器（EKF）使得在相机更新率从30 Hz降低到1 Hz时仍能实现稳定的闭环控制，模拟临床上现实的成像限制。这些结果确立了基于有限元的MPC与传感器融合作为俯仰调节、受控对接和未来多自由度胶囊运动的可扩展策略。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.10686

Free-Flying Crew Cooperative Robots on the ISS: A Joint Review of Astrobee, CIMON, and Int-Ball Operations

国际空间站上的自由飞行协作机器人：Astrobee、CIMON 和 Int-Ball 操作的联合评述

Yamaguchi, Seiko Piotr, Vargas, Andres Mora, Eisenberg, Till, Rogon, Christian, Yamamoto, Tatsuya, Inoue, Shona, Kössl, Christoph, Coltin, Brian, Smith, Trey, Benavides, Jose V.

Abstract

Intra-vehicular free-flying robots are anticipated to support various work in human spaceflight while working side-by-side with astronauts. Such example of robots includes NASA's Astrobee, DLR's CIMON, and JAXA's Int-Ball, which are deployed on the International Space Station. This paper presents the first joint analyses of these robot's shared experiences, co-authored by their development and operation team members. Despite the different origins and design philosophies, the development and operations of these platforms encountered various convergences. Hence, this paper presents a detailed overview of these robots, presenting their objectives, design, and onboard operations. Hence, joint lessons learned across the lifecycle are presented, from design to on-orbit operations. These lessons learned are anticipated to serve for future development and research as design recommendations.

Chinese Translation

车内自由飞行机器人预计将在载人航天飞行中支持各种工作，并与宇航员并肩作业。这类机器人的例子包括美国宇航局的 Astrobee、德国宇航中心的 CIMON 和日本宇宙航空研究开发机构的 Int-Ball，它们被部署在国际空间站上。本文首次对这些机器人共享经验进行了联合分析，由其开发和操作团队成员共同撰写。尽管这些平台的起源和设计理念各不相同，但它们在开发和操作过程中却遇到了多种相似之处。因此，本文详细概述了这些机器人的目标、设计和机载操作。此外，本文还总结了从设计到在轨操作的生命周期中的共同经验教训。这些经验教训预计将为未来的开发和研究提供设计建议。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.10688

3D-Printed Anisotropic Soft Magnetic Coating for Directional Rolling of a Magnetically Actuated Capsule Robot

用于磁驱动胶囊机器人定向滚动的3D打印各向异性软磁涂层

Zhou, Jin, Wang, Chongxun, Shen, Zikang, Xia, Fangzhou

Abstract

Capsule robots are promising tools for minimally invasive diagnostics and therapy, with applications from gastrointestinal endoscopy to targeted drug delivery and biopsy sampling. Conventional magnetic capsule robots embed bulky permanent magnets at both ends, reducing the usable cavity by about 10-20 mm and limiting integration of functional modules. We propose a compact, 3D-printed soft capsule robot with a magnetic coating that replaces internal magnets, enabling locomotion via a thin, functional shell while preserving the entire interior cavity as a continuous volume for medical payloads. The compliant silicone-magnetic composite also improves swallowability, even with a slightly larger capsule size. Magnetostatic simulations and experiments confirm that programmed NSSN/SNNS pole distributions provide strong anisotropy and reliable torque generation, enabling stable bidirectional rolling, omnidirectional steering, climbing on 7.5 degree inclines, and traversal of 5 mm protrusions. Rolling motion is sustained when the magnetic field at the capsule reaches at least 0.3 mT, corresponding to an effective actuation depth of 30 mm in our setup. Future work will optimize material composition, coating thickness, and magnetic layouts to enhance force output and durability, while next-generation robotic-arm-based field generators with closed-loop feedback will address nonlinearities and expand maneuverability. Together, these advances aim to transition coating-based capsule robots toward reliable clinical deployment and broaden their applications in minimally invasive diagnostics and therapy.

Chinese Translation

胶囊机器人是用于微创诊断和治疗的有前景的工具，应用范围从胃肠内窥镜检查到靶向药物输送和活检取样。传统的磁性胶囊机器人在两端嵌入笨重的永久磁铁，减少了约10-20毫米的可用腔体，限制了功能模块的集成。我们提出了一种紧凑的3D打印软胶囊机器人，采用磁性涂层替代内部磁铁，通过薄而功能性外壳实现运动，同时保持整个内部腔体作为医疗载荷的连续体积。柔性硅胶-磁性复合材料还改善了吞咽性，即使胶囊尺寸稍大。磁静态模拟和实验确认，编程的NSSN/SNNS极性分布提供了强大的各向异性和可靠的扭矩生成，能够实现稳定的双向滚动、全向转向、在7.5度坡道上爬升以及跨越5毫米的突出物。当胶囊内的磁场达到至少0.3 mT时，滚动运动得以维持，这对应于我们设置中有效驱动深度为30毫米。未来的工作将优化材料成分、涂层厚度和磁性布局，以增强输出力和耐用性，同时下一代基于机器人臂的场发生器将采用闭环反馈来解决非线性问题并扩展机动性。这些进展旨在推动基于涂层的胶囊机器人向可靠的临床应用过渡，并拓宽其在微创诊断和治疗中的应用。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.10702

A Unified Experimental Architecture for Informative Path Planning: from Simulation to Deployment with GuadalPlanner

用于信息路径规划的统一实验架构：从仿真到部署的 GuadalPlanner

Barrionuevo, Alejandro Mendoza, Diop, Dame Seck, Pérez, Alejandro Casado, Reina, Daniel Gutiérrez, Marín, Sergio L. Toral, Luis, Samuel Yanes

Abstract

The evaluation of informative path planning algorithms for autonomous vehicles is often hindered by fragmented execution pipelines and limited transferability between simulation and real-world deployment. This paper introduces a unified architecture that decouples high-level decision-making from vehicle-specific control, enabling algorithms to be evaluated consistently across different abstraction levels without modification. The proposed architecture is realized through GuadalPlanner, which defines standardized interfaces between planning, sensing, and vehicle execution. It is an open and extensible research tool that supports discrete graph-based environments and interchangeable planning strategies, and is built upon widely adopted robotics technologies, including ROS2, MAVLink, and MQTT. Its design allows the same algorithmic logic to be deployed in fully simulated environments, software-in-the-loop configurations, and physical autonomous vehicles using an identical execution pipeline. The approach is validated through a set of experiments, including real-world deployment on an autonomous surface vehicle performing water quality monitoring with real-time sensor feedback.

Chinese Translation

自主车辆的信息路径规划算法评估常常受到分散执行管道和仿真与现实世界部署之间有限可转移性的阻碍。本文提出了一种统一架构，将高层决策与车辆特定控制解耦，从而使算法能够在不同抽象层次上无须修改地进行一致评估。所提出的架构通过 GuadalPlanner 实现，定义了规划、感知和车辆执行之间的标准化接口。它是一个开放且可扩展的研究工具，支持离散图形环境和可互换的规划策略，并基于广泛采用的机器人技术构建，包括 ROS2、MAVLink 和 MQTT。其设计允许相同的算法逻辑在完全仿真的环境、软件在环配置以及物理自主车辆中使用相同的执行管道进行部署。该方法通过一系列实验进行验证，包括在自主水面车辆上进行的现实世界部署，执行水质监测并提供实时传感器反馈。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.10703

Omnidirectional Dual-Arm Aerial Manipulator with Proprioceptive Contact Localization for Landing on Slanted Roofs

具有自感知接触定位能力的全向双臂空中操控器在倾斜屋顶上的着陆

Brummelhuis, Martijn B. J., Lepora, Nathan F., Hamaza, Salua

Abstract

Operating drones in urban environments often means they need to land on rooftops, which can have different geometries and surface irregularities. Accurately detecting roof inclination using conventional sensing methods, such as vision-based or acoustic techniques, can be unreliable, as measurement quality is strongly influenced by external factors including weather conditions and surface materials. To overcome these challenges, we propose a novel unmanned aerial manipulator morphology featuring a dual-arm aerial manipulator with an omnidirectional 3D workspace and extended reach. Building on this design, we develop a proprioceptive contact detection and contact localization strategy based on a momentum-based torque observer. This enables the UAM to infer the inclination of slanted surfaces blindly - through physical interaction - prior to touchdown. We validate the approach in flight experiments, demonstrating robust landings on surfaces with inclinations of up to 30.5 degrees and achieving an average surface inclination estimation error of 2.87 degrees over 9 experiments at different incline angles.

Chinese Translation

在城市环境中操作无人机通常意味着它们需要在屋顶上着陆，而屋顶的几何形状和表面不规则性可能各不相同。使用传统传感方法（如基于视觉或声学的技术）准确检测屋顶倾斜度可能不可靠，因为测量质量受到天气条件和表面材料等外部因素的强烈影响。为了克服这些挑战，我们提出了一种新型无人机操控器形态，具备全向三维工作空间和扩展的操作范围的双臂空中操控器。在此设计基础上，我们开发了一种基于动量扭矩观察器的自感知接触检测和接触定位策略。这使得无人机能够在着陆前通过物理交互盲目推断倾斜表面的倾斜度。我们在飞行实验中验证了该方法，展示了在倾斜度高达30.5度的表面上稳健着陆，并在不同倾斜角度的9次实验中实现了平均表面倾斜度估计误差为2.87度。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.10717

Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

说、梦与行动：用于指令驱动机器人操控的视频世界模型学习

Gu, Songen, Cai, Yunuo, Wang, Tianyu, Wu, Simo, Fu, Yanwei

Abstract

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.

Chinese Translation

机器人操控需要预测环境如何根据动作演变，但现有大多数系统缺乏这种预测能力，常常导致错误和低效。尽管视觉-语言模型（VLMs）提供了高层次的指导，但它们无法明确预测未来状态，而现有的世界模型要么仅预测短期范围，要么生成空间不一致的帧。为了解决这些挑战，我们提出了一种快速且具有预测能力的视频条件动作框架。我们的方法首先选择并调整一个稳健的视频生成模型，以确保可靠的未来预测，然后应用对抗蒸馏技术实现快速的少步视频生成，最后训练一个动作模型，利用生成的视频和真实观察来纠正空间错误。大量实验表明，我们的方法生成了时间上连贯、空间上准确的视频预测，直接支持精确操控，在体现一致性、空间指称能力和任务完成度方面显著优于现有基线。代码和模型将会发布。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.10719

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

从表征互补性到双系统：协同视觉语言模型（VLM）与仅视觉骨干网络的端到端驾驶

Ang, Sining, Yang, Yuguang, Dang, Chenxu, Chen, Canyu, Chi, Cheng, Liu, Haiyan, Mao, Xuanyao, Bao, Jason, Xuliang, Sun, Bingchuan, Wang, Yan

Abstract

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

Chinese Translation

视觉-语言-行动（VLA）驾驶通过语言支持的骨干网络增强了端到端（E2E）规划，但在通常的准确性与成本权衡之外，仍不清楚会发生什么变化。我们通过在相同的扩散 Transformer 规划器下，使用完整的 VLM 和仅视觉骨干网络重新审视这个问题，进行 3-RQ 分析。RQ1：在骨干网络层面，VLM 可以在仅视觉骨干网络的基础上引入额外的子空间。RQ2：这个独特的子空间在一些长尾场景中导致了不同的行为：VLM 更倾向于激进，而 ViT 更为保守，并且在大约 2-3% 的测试场景中各自取得决定性的胜利；通过一个在每个场景中选择 VLM 和 ViT 分支中更好轨迹的神谕，我们获得了 93.58 的 PDMS 上限。RQ3：为了充分利用这一观察结果，我们提出了 HybridDriveVLA，它同时运行 ViT 和 VLM 分支，并使用学习的评分器在它们的终点轨迹之间进行选择，将 PDMS 提升至 92.10。最后，DualDriveVLA 实现了一种实用的快-慢策略：默认运行 ViT，仅在评分器的置信度低于阈值时调用 VLM；在 15% 的场景中调用 VLM 达到 91.00 的 PDMS，同时提高了 3.2 倍的吞吐量。代码将会发布。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.10904

Biomimetic Mantaray robot toward the underwater autonomous -- Experimental verification of swimming and diving by flapping motion -

仿生魔鬼鱼机器人用于水下自主探索——拍打运动的游泳与潜水实验验证

Tabata, Kenta, Oku, Ryosuke, Ito, Jun, Miyagusuku, Renato, Ozaki, Koichi

Abstract

This study presents the development and experimental verification of a biomimetic manta ray robot for underwater autonomous exploration. Inspired by manta rays, the robot uses flapping motion for propulsion to minimize seabed disturbance and enhance efficiency compared to traditional screw propulsion. The robot features pectoral fins driven by servo motors and a streamlined control box to reduce fluid resistance. The control system, powered by a Raspberry Pi 3B, includes an IMU and pressure sensor for real-time monitoring and control. Experiments in a pool assessed the robot's swimming and diving capabilities. Results show stable swimming and diving motions with PD control. The robot is suitable for applications in environments like aquariums and fish nurseries, requiring minimal disturbance and efficient maneuverability. Our findings demonstrate the potential of bio-inspired robotic designs to improve ecological monitoring and underwater exploration.

Chinese Translation

本研究展示了一种仿生魔鬼鱼机器人的开发及其实验验证，旨在进行水下自主探索。该机器人受到魔鬼鱼的启发，采用拍打运动作为推进方式，以减少对海床的干扰并提高效率，相较于传统的螺旋推进方式更具优势。机器人配备由伺服电机驱动的胸鳍和流线型控制箱，以降低流体阻力。控制系统由Raspberry Pi 3B供电，包含IMU和压力传感器，用于实时监测和控制。在水池中的实验评估了机器人的游泳和潜水能力。结果表明，机器人在PD控制下实现了稳定的游泳和潜水动作。该机器人适用于水族馆和鱼苗场等环境，能够实现最小干扰和高效机动。我们的研究结果展示了仿生机器人设计在改善生态监测和水下探索方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.10910

Safe mobility support system using crowd mapping and avoidance route planning using VLM

基于人群映射和避免路线规划的安全移动支持系统使用视觉-语言模型（VLM）

Saito, Sena, Tabata, Kenta, Miyagusuku, Renato, Ozaki, Koichi

Abstract

Autonomous mobile robots offer promising solutions for labor shortages and increased operational efficiency. However, navigating safely and effectively in dynamic environments, particularly crowded areas, remains challenging. This paper proposes a novel framework that integrates Vision-Language Models (VLM) and Gaussian Process Regression (GPR) to generate dynamic crowd-density maps (``Abstraction Maps'') for autonomous robot navigation. Our approach utilizes VLM's capability to recognize abstract environmental concepts, such as crowd densities, and represents them probabilistically via GPR. Experimental results from real-world trials on a university campus demonstrated that robots successfully generated routes avoiding both static obstacles and dynamic crowds, enhancing navigation safety and adaptability.

Chinese Translation

自主移动机器人为解决劳动力短缺和提高运营效率提供了有前景的解决方案。然而，在动态环境中，尤其是拥挤区域，安全有效地导航仍然具有挑战性。本文提出了一种新颖的框架，集成了视觉-语言模型（VLM）和高斯过程回归（GPR），以生成用于自主机器人导航的动态人群密度图（“抽象图”）。我们的方法利用VLM识别抽象环境概念（如人群密度）的能力，并通过GPR以概率方式表示这些概念。来自大学校园的实际试验结果表明，机器人成功生成了避开静态障碍物和动态人群的路线，从而增强了导航的安全性和适应性。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.10942

Design, Development, and Use of Maya Robot as an Assistant for the Therapy/Education of Children with Cancer: a Pilot Study

Maya机器人作为癌症儿童治疗/教育助手的设计、开发与应用：一项初步研究

Taheri, Alireza, Alemi, Minoo, Ranjkar, Elham, Rafatnejad, Raman, Meghdari, Ali F.

Abstract

This study centers around the design and implementation of the Maya Robot, a portable elephant-shaped social robot, intended to engage with children undergoing cancer treatment. Initial efforts were devoted to enhancing the robot's facial expression recognition accuracy, achieving a 98% accuracy through deep neural networks. Two subsequent preliminary exploratory experiments were designed to advance the study's objectives. The first experiment aimed to compare pain levels experienced by children during the injection process, with and without the presence of the Maya robot. Twenty-five children, aged 4 to 9, undergoing cancer treatment participated in this counterbalanced study. The paired T-test results revealed a significant reduction in perceived pain when the robot was actively present in the injection room. The second experiment sought to assess perspectives of hospitalized children and their mothers during engagement with Maya through a game. Forty participants, including 20 children aged 4 to 9 and their mothers, were involved. Post Human-Maya Interactions, UTAUT questionnaire results indicated that children experienced significantly less anxiety than their parents during the interaction and game play. Notably, children exhibited higher trust levels in both the robot and the games, presenting a statistically significant difference in trust levels compared to their parents (P-value < 0.05). This preliminary exploratory study highlights the positive impact of utilizing Maya as an assistant for therapy/education in a clinical setting, particularly benefiting children undergoing cancer treatment. The findings underscore the potential of social robots in pediatric healthcare contexts, emphasizing improved pain management and emotional well-being among young patients.

Chinese Translation

本研究围绕Maya机器人（一个便携式的大象形状社交机器人）的设计与实施展开，旨在与正在接受癌症治疗的儿童进行互动。最初的工作集中在提高机器人面部表情识别的准确性，通过深度神经网络实现了98%的准确率。随后设计了两个初步探索性实验，以推进研究目标。第一个实验旨在比较儿童在注射过程中在有无Maya机器人存在下所体验的疼痛水平。25名年龄在4至9岁之间的癌症治疗儿童参与了这项对照研究。配对T检验结果显示，当机器人在注射室内积极存在时，感知的疼痛显著减少。第二个实验旨在评估住院儿童及其母亲在通过游戏与Maya互动时的看法。共有40名参与者，包括20名4至9岁的儿童及其母亲参与其中。在人机互动后，UTAUT问卷结果表明，儿童在互动和游戏过程中体验到的焦虑显著低于其母亲。值得注意的是，儿童对机器人和游戏的信任水平较高，与其母亲相比，信任水平存在统计学显著差异（P值 < 0.05）。这项初步探索性研究突显了在临床环境中利用Maya作为治疗/教育助手的积极影响，尤其是对正在接受癌症治疗的儿童的益处。研究结果强调了社交机器人在儿科医疗环境中的潜力，突出了改善疼痛管理和年轻患者情感健康的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.10946

Developing Neural Network-Based Gaze Control Systems for Social Robots

基于神经网络的社交机器人注视控制系统的开发

Tabatabaei, Ramtin, Taheri, Alireza

Abstract

During multi-party interactions, gaze direction is a key indicator of interest and intent, making it essential for social robots to direct their attention appropriately. Understanding the social context is crucial for robots to engage effectively, predict human intentions, and navigate interactions smoothly. This study aims to develop an empirical motion-time pattern for human gaze behavior in various social situations (e.g., entering, leaving, waving, talking, and pointing) using deep neural networks based on participants' data. We created two video clips-one for a computer screen and another for a virtual reality headset-depicting different social scenarios. Data were collected from 30 participants: 15 using an eye-tracker and 15 using an Oculus Quest 1 headset. Deep learning models, specifically Long Short-Term Memory (LSTM) and Transformers, were used to analyze and predict gaze patterns. Our models achieved 60% accuracy in predicting gaze direction in a 2D animation and 65% accuracy in a 3D animation. Then, the best model was implemented onto the Nao robot; and 36 new participants evaluated its performance. The feedback indicated overall satisfaction, with those experienced in robotics rating the models more favorably.

Chinese Translation

在多方互动中，注视方向是兴趣和意图的重要指示，因此社交机器人需要适当地引导其注意力。理解社交背景对于机器人有效参与、预测人类意图和顺利进行互动至关重要。本研究旨在利用基于参与者数据的深度神经网络，开发出适用于各种社交情境（如进入、离开、挥手、交谈和指向）的人类注视行为的经验运动时间模式。我们制作了两个视频片段——一个用于计算机屏幕，另一个用于虚拟现实头戴设备，展示不同的社交场景。数据收集来自30名参与者：15名使用眼动仪，15名使用Oculus Quest 1头戴设备。我们使用深度学习模型，特别是长短期记忆网络（LSTM）和变换器（Transformers），分析和预测注视模式。我们的模型在2D动画中预测注视方向的准确率达到了60%，在3D动画中达到了65%。随后，最佳模型被应用于Nao机器人，并由36名新参与者评估其性能。反馈表明总体满意度较高，具有机器人经验的参与者对模型的评价更为积极。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.10961

Stability Analysis of Geometric Control for a Canonical Class of Underactuated Aerial Vehicles with Spurious Forces

带有虚假力的典型欠驱动空中飞行器几何控制的稳定性分析

Orelli, Simone, Mizzoni, Mirko, Franchi, Antonio

Abstract

Standard geometric control relies on force-moment decoupling, an assumption that breaks down in many aerial platforms due to spurious forces naturally induced by control moments. While strategies for such coupled systems have been validated experimentally, a rigorous theoretical certification of their stability is currently missing. This work fills this gap by providing the first formal stability analysis for a generic class of floating rigid bodies subject to spurious forces. We introduce a canonical model and construct a Lyapunov-based proof establishing local exponential stability of the hovering equilibrium. Crucially, the analysis explicitly addresses the structural challenges - specifically the induced non-minimum-phase behavior - that prevent the application of standard cascade arguments.

Chinese Translation

标准几何控制依赖于力-力矩解耦，这一假设在许多空中平台中由于控制力矩自然引发的虚假力而失效。尽管针对这种耦合系统的策略已在实验中得到了验证，但目前缺乏对其稳定性的严格理论证明。本工作填补了这一空白，首次对受虚假力影响的一类通用浮动刚体进行了正式的稳定性分析。我们引入了一个典型模型，并构建了基于Lyapunov的方法，证明了悬停平衡点的局部指数稳定性。关键在于，分析明确解决了结构性挑战——特别是引发的非最小相位行为——这阻碍了标准级联论证的应用。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.10980

RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

RADAR：通过真实世界动态、空间-物理智能和自主评估对视觉-语言-行动泛化进行基准测试

Chen, Yuhao, Zhan, Zhihao, Lin, Xiaoxin, Song, Zijian, Liu, Hao, Lyu, Qinhan, Zu, Yubo, Chen, Xiao, Liu, Zhiyuan, Pu, Tao, Chen, Tianshui, Wang, Keze, Lin, Liang, Wang, Guangrun

Abstract

VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.

Chinese Translation

视觉-语言-行动（VLA）模型在具身智能方面取得了显著进展；然而，它们的评估仍然主要局限于模拟或高度受限的真实世界环境。这种不匹配造成了实质性的现实差距，强大的基准性能往往掩盖了在多样化物理环境中的糟糕泛化。我们识别出当前基准测试实践中的三个系统性缺陷，这些缺陷妨碍了公平和可靠的模型比较。（1）现有基准未能模拟真实世界动态，忽视了动态物体配置、机器人初始状态、光照变化和传感器噪声等关键因素。（2）当前协议忽视空间-物理智能，将评估简化为不涉及几何推理的机械操作任务。（3）该领域缺乏可扩展的完全自主评估，而是依赖于简单的二维指标，这些指标无法捕捉三维空间结构，或依赖于成本高昂、存在偏见且不可扩展的人机协作系统。为了解决这些限制，我们引入了RADAR（真实世界自主动态与推理），这是一个旨在系统性评估VLA泛化能力的基准，适用于现实条件。RADAR整合了三个核心组件：（1）一套有原则的物理动态；（2）专门的任务，明确测试空间推理和物理理解；（3）基于三维指标的完全自主评估管道，消除了对人工监督的需求。我们应用RADAR对多个最先进的VLA模型进行审计，揭示了它们表面能力下的严重脆弱性。在适度的物理动态下，性能急剧下降，三维交并比（3D IoU）的期望值从0.261降至0.068，受到传感器噪声的影响。此外，模型表现出有限的空间推理能力。这些发现使RADAR成为对VLA模型进行可靠和可泛化的真实世界评估的必要基准。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.10983

Scaling World Model for Hierarchical Manipulation Policies

用于层次化操控策略的世界模型扩展

Long, Qian, Wang, Yueze, Song, Jiaxi, Zhang, Junbo, Li, Peiyan, Wang, Wenxuan, Wang, Yuqi, Li, Haoyang, Xie, Shaoxuan, Yao, Guocai, Zhang, Hanbo, Wang, Xinlong, Wang, Zhongyuan, Lan, Xuguang, Liu, Huaping, Li, Xinghang

Abstract

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

Chinese Translation

视觉-语言-动作（VLA）模型在通用机器人操控中展现出良好的前景，但在分布外（OOD）环境中依然脆弱，尤其是在真实机器人数据有限的情况下。为了解决泛化瓶颈，我们提出了一种层次化的视觉-语言-动作框架 extit{our}{}，该框架利用大规模预训练世界模型的泛化能力，实现稳健且可泛化的视觉子目标任务分解（VISTA）。我们的层次化框架 extit{our}{} 包含一个作为高层规划者的世界模型和一个作为低层执行者的VLA。高层世界模型首先将操控任务分解为带有目标图像的子任务序列，而低层策略则根据文本和视觉指导生成动作序列。与原始文本目标规范相比，这些合成的目标图像为低层策略提供了视觉和物理上的具体细节，使其能够在未见对象和新场景中实现泛化。我们在大量分布外场景中验证了视觉目标合成和我们的层次化VLA策略，结果表明，在新场景中，相同结构的VLA的性能可以在世界模型生成的指导下从14%提升至69%。结果表明，我们的方法在分布外场景中明显优于之前的基线。项目页面： extit{https://vista-wm.github.io/}

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.10997

Multi-Task Reinforcement Learning of Drone Aerobatics by Exploiting Geometric Symmetries

通过利用几何对称性实现无人机特技飞行的多任务强化学习

Guo, Zhanyu, Yin, Zikang, Zhu, Guobin, Guo, Shiliang, Zhao, Shiyu

Abstract

Flight control for autonomous micro aerial vehicles (MAVs) is evolving from steady flight near equilibrium points toward more aggressive aerobatic maneuvers, such as flips, rolls, and Power Loop. Although reinforcement learning (RL) has shown great potential in these tasks, conventional RL methods often suffer from low data efficiency and limited generalization. This challenge becomes more pronounced in multi-task scenarios where a single policy is required to master multiple maneuvers. In this paper, we propose a novel end-to-end multi-task reinforcement learning framework, called GEAR (Geometric Equivariant Aerobatics Reinforcement), which fully exploits the inherent SO(2) rotational symmetry in MAV dynamics and explicitly incorporates this property into the policy network architecture. By integrating an equivariant actor network, FiLM-based task modulation, and a multi-head critic, GEAR achieves both efficiency and flexibility in learning diverse aerobatic maneuvers, enabling a data-efficient, robust, and unified framework for aerobatic control. GEAR attains a 98.85\% success rate across various aerobatic tasks, significantly outperforming baseline methods. In real-world experiments, GEAR demonstrates stable execution of multiple maneuvers and the capability to combine basic motion primitives to complete complex aerobatics.

Chinese Translation

自主微型飞行器（MAV）的飞行控制正从在平衡点附近的稳定飞行向更具攻击性的特技动作演变，例如翻转、滚动和动力环。尽管强化学习（RL）在这些任务中展现了巨大的潜力，但传统的RL方法往往面临数据效率低和泛化能力有限的问题。在需要单一策略掌握多种动作的多任务场景中，这一挑战尤为明显。本文提出了一种新颖的端到端多任务强化学习框架，称为GEAR（几何不变特技强化学习），该框架充分利用了MAV动力学中的SO(2)旋转对称性，并将这一特性明确地融入到策略网络架构中。通过集成不变性演员网络、基于FiLM的任务调制和多头评论员，GEAR在学习多样特技动作方面实现了效率和灵活性的结合，从而构建了一个数据高效、鲁棒且统一的特技控制框架。GEAR在各种特技任务中达到了98.85%的成功率，显著优于基线方法。在真实世界实验中，GEAR展示了多种动作的稳定执行能力，并能够将基本运动原语组合以完成复杂特技。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.11021

ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

ContactGaussian-WM：从视频中学习基于物理的世界模型

Wang, Meizhong, Jin, Wanxin, Cao, Kun, Xie, Lihua, Hong, Yiguang

Abstract

Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic motion.To address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.

Chinese Translation

开发能够理解复杂物理交互的世界模型对于推进机器人规划和仿真至关重要。然而，现有方法在数据稀缺和复杂接触丰富的动态运动条件下，往往难以准确建模环境。为了解决这些挑战，我们提出了ContactGaussian-WM，这是一种可微分的基于物理的刚体世界模型，能够直接从稀疏且接触丰富的视频序列中学习复杂的物理法则。我们的框架由两个核心组件组成：（1）用于视觉外观和碰撞几何的统一高斯表示，以及（2）一个端到端的可微分学习框架，通过封闭形式的物理引擎进行微分，从稀疏视觉观测中推断物理属性。大量的仿真和现实世界评估表明，ContactGaussian-WM在学习复杂场景方面优于最先进的方法，展现出强大的泛化能力。此外，我们展示了该框架在下游应用中的实际效用，包括数据合成和实时模型预测控制（MPC）。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.11075

RISE: Self-Improving Robot Policy with Compositional World Model

RISE：具有组合世界模型的自我改进机器人策略

Yang, Jiazhi, Lin, Kunyang, Li, Jinwei, Zhang, Wencong, Lin, Tianwei, Wu, Longyan, Su, Zhizhong, Zhao, Hao, Zhang, Ya-Qin, Chen, Li, Luo, Ping, Yue, Xiangyu, Li, Hongyang

Abstract

Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi-view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best-suited yet distinct architectures and objectives. These components are integrated into a closed-loop self-improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real-world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively.

Chinese Translation

尽管模型容量和数据获取持续扩大，视觉-语言-动作（VLA）模型在接触丰富和动态操作任务中仍然脆弱，微小的执行偏差可能会导致失败。虽然强化学习（RL）提供了一条实现鲁棒性的原则性路径，但在物理世界中的在线强化学习受到安全风险、硬件成本和环境重置的限制。为了解决这一问题，我们提出了RISE，一个通过想象实现可扩展的机器人强化学习框架。其核心是一个组合世界模型，该模型（i）通过可控的动态模型预测多视角未来，并且（ii）通过进展值模型评估想象的结果，为策略改进提供有用的优势。这种组合设计允许状态和值通过最合适但又不同的架构和目标进行定制。这些组件集成到一个闭环自我改进的流程中，持续生成想象的回放，估计优势，并在想象空间中更新策略，而无需昂贵的物理交互。在三个具有挑战性的现实任务中，RISE相较于现有技术取得了显著的改进，动态砖块排序的绝对性能提高超过35%，背包打包提高45%，箱子关闭提高35%。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.11082

Digging for Data: Experiments in Rock Pile Characterization Using Only Proprioceptive Sensing in Excavation

挖掘数据：仅利用本体感觉传感器进行岩石堆特征化的实验

Artan, Unal, Magnusson, Martin, Marshall, Joshua A.

Abstract

Characterization of fragmented rock piles is a fundamental task in the mining and quarrying industries, where rock is fragmented by blasting, transported using wheel loaders, and then sent for further processing. This field report studies a novel method for estimating the relative particle size of fragmented rock piles from only proprioceptive data collected while digging with a wheel loader. Rather than employ exteroceptive sensors (e.g., cameras or LiDAR sensors) to estimate rock particle sizes, the studied method infers rock fragmentation from an excavator's inertial response during excavation. This paper expands on research that postulated the use of wavelet analysis to construct a unique feature that is proportional to the level of rock fragmentation. We demonstrate through extensive field experiments that the ratio of wavelet features, constructed from data obtained by excavating in different rock piles with different size distributions, approximates the ratio of the mean particle size of the two rock piles. Full-scale excavation experiments were performed with a battery electric, 18-tonne capacity, load-haul-dump (LHD) machine in representative conditions in an operating quarry. The relative particle size estimates generated with the proposed sensing methodology are compared with those obtained from both a vision-based fragmentation analysis tool and from sieving of sampled materials.

Chinese Translation

对碎石堆的特征化是采矿和采石行业中的一项基本任务，其中岩石通过爆破被碎裂，使用装载机运输，然后送往进一步加工。本报告研究了一种新颖的方法，通过仅使用在装载机挖掘过程中收集的本体感觉数据来估计碎石堆的相对颗粒大小。该方法并未采用外部传感器（例如，摄像头或激光雷达传感器）来估计岩石颗粒大小，而是通过挖掘过程中挖掘机的惯性响应推断岩石的碎裂程度。本文扩展了先前研究的成果，该研究假设使用小波分析构建一个与岩石碎裂程度成比例的独特特征。我们通过广泛的现场实验展示，从不同大小分布的岩石堆挖掘获得的数据构建的小波特征比率，近似于两个岩石堆的平均颗粒大小比率。我们在一个正在运营的采石场的代表性条件下，使用一台电池电动、18吨容量的装载-运输-卸载（LHD）机器进行了全尺度的挖掘实验。利用所提出的传感方法生成的相对颗粒大小估计与通过视觉碎裂分析工具和对采样材料进行筛分获得的结果进行了比较。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.11113

A receding-horizon multi-contact motion planner for legged robots in challenging environments

一种针对复杂环境的四足机器人回退视野多接触运动规划器

Derwent, Daniel S. J., Watson, Simon, Adorno, Bruno V.

Abstract

We present a novel receding-horizon multi-contact motion planner for legged robots in challenging scenarios, able to plan motions such as chimney climbing, navigating very narrow passages or crossing large gaps. Our approach adds new capabilities to the state of the art, including the ability to reactively re-plan in response to new information, and planning contact locations and whole-body trajectories simultaneously, simplifying the implementation and removing the need for post-processing or complex multi-stage approaches. Our method is more resistant to local minima problems than other potential field based approaches, and our quadratic-program-based posture generator returns nodes more quickly than those of existing algorithms. Rigorous statistical analysis shows that, with short planning horizons (e.g., one step ahead), our planner is faster than the state-of-the-art across all scenarios tested (between 45% and 98% faster on average, depending on the scenario), while planning less efficient motions (requiring 5% fewer to 700% more stance changes on average). In all but one scenario (Chimney Walking), longer planning horizons (e.g., four steps ahead) extended the average planning times (between 73% faster and 400% slower than the state-of-the-art) but resulted in higher quality motion plans (between 8% more and 47% fewer stance changes than the state-of-the-art).

Chinese Translation

我们提出了一种新颖的回退视野多接触运动规划器，适用于在复杂场景中操作的四足机器人，能够规划如爬烟囱、穿越狭窄通道或跨越大间隙等动作。我们的方法为现有技术增加了新功能，包括能够根据新信息进行反应性重新规划，以及同时规划接触位置和全身轨迹，从而简化了实现过程，消除了后处理或复杂多阶段方法的需求。与其他基于势场的方法相比，我们的方法对局部极小值问题的抵抗力更强，而我们的基于二次规划的姿态生成器比现有算法更快地返回节点。严格的统计分析表明，在短规划视野下（例如，前进一步），我们的规划器在所有测试场景中都比现有技术更快（平均速度提升在45%到98%之间，具体取决于场景），同时规划的动作效率较低（平均需要减少5%到增加700%的支撑变化）。在除一个场景（爬烟囱）外，较长的规划视野（例如，前四步）延长了平均规划时间（比现有技术快73%到慢400%），但产生了更高质量的运动规划（比现有技术多8%到少47%的支撑变化）。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.11142

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

基于归一化流的数据高效层次目标条件强化学习

Garg, Shaswat, Moezzi, Matin, Da Silva, Brandon

Abstract

Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.

Chinese Translation

层次目标条件强化学习（H-GCRL）为解决复杂的长时间任务提供了一个强大的框架，通过将任务分解为结构化的子目标。然而，其实际应用受到数据效率低下和策略表达能力有限的制约，尤其是在离线或数据稀缺的情况下。本文提出了一种基于归一化流的层次隐式 Q 学习（NF-HIQL）新框架，该框架在层次的高层和低层均用表达能力强的归一化流策略替代单模高斯策略。这种设计使得可处理的对数似然计算、有效采样以及建模丰富的多模态行为成为可能。我们推导了新的理论保证，包括对实值非体积保持（RealNVP）策略的显式 KL 散度界限和 PAC 风格的样本效率结果，表明 NF-HIQL 在提高泛化能力的同时保持了稳定性。在实证方面，NF-HIQL 在 OGBench 中的多种长时间任务（如运动、运球和多步操作）上进行了评估。NF-HIQL 始终优于先前的目标条件和层次基线，展示了在数据有限的情况下的卓越鲁棒性，并突显了基于流的架构在可扩展、数据高效的层次强化学习中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.11143

APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

APEX：学习适应性高平台穿越的类人机器人

Wang, Yikai, Leng, Tingxuan, Lin, Changyi, Liu, Shiqi, Simon, Shir, Chen, Bingqing, Francis, Jonathan, Zhao, Ding

Abstract

Humanoid locomotion has advanced rapidly with deep reinforcement learning (DRL), enabling robust feet-based traversal over uneven terrain. Yet platforms beyond leg length remain largely out of reach because current RL training paradigms often converge to jumping-like solutions that are high-impact, torque-limited, and unsafe for real-world deployment. To address this gap, we propose APEX, a system for perceptive, climbing-based high-platform traversal that composes terrain-conditioned behaviors: climb-up and climb-down at vertical edges, walking or crawling on the platform, and stand-up and lie-down for posture reconfiguration. Central to our approach is a generalized ratchet progress reward for learning contact-rich, goal-reaching maneuvers. It tracks the best-so-far task progress and penalizes non-improving steps, providing dense yet velocity-free supervision that enables efficient exploration under strong safety regularization. Based on this formulation, we train LiDAR-based full-body maneuver policies and reduce the sim-to-real perception gap through a dual strategy: modeling mapping artifacts during training and applying filtering and inpainting to elevation maps during deployment. Finally, we distill all six skills into a single policy that autonomously selects behaviors and transitions based on local geometry and commands. Experiments on a 29-DoF Unitree G1 humanoid demonstrate zero-shot sim-to-real traversal of 0.8 meter platforms (approximately 114% of leg length), with robust adaptation to platform height and initial pose, as well as smooth and stable multi-skill transitions.

Chinese Translation

类人运动在深度强化学习（DRL）的推动下迅速发展，使得类人机器人能够在不平坦的地形上实现稳健的足部穿越。然而，超出腿长的高平台仍然在很大程度上无法实现，因为当前的强化学习训练范式往往收敛于类似跳跃的解决方案，这些方案具有高冲击、扭矩限制，并且在实际应用中不安全。为了解决这一问题，我们提出了APEX，一个用于感知性、基于攀爬的高平台穿越系统，该系统组合了基于地形的行为：在垂直边缘的向上攀爬和向下攀爬、在平台上行走或爬行，以及为姿态重配置而进行的站立和躺下。我们方法的核心是一个广义的棘轮进展奖励，用于学习接触丰富的、目标导向的动作。该奖励跟踪迄今为止的最佳任务进展，并对未改进的步骤进行惩罚，提供密集但无速度的监督，从而在强安全正则化下实现高效探索。基于这一框架，我们训练了基于激光雷达的全身操控策略，并通过双重策略减少了模拟到现实的感知差距：在训练期间建模映射伪影，并在部署期间对高程图进行滤波和修复。最后，我们将所有六项技能提炼为单一策略，该策略能够根据局部几何形状和指令自主选择行为和过渡。在29自由度的Unitree G1类人机器人上进行的实验表明，能够实现0.8米高平台（约114%腿长）的零样本模拟到现实穿越，且对平台高度和初始姿态具有稳健的适应能力，以及平滑稳定的多技能过渡。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.11150

YOR: Your Own Mobile Manipulator for Generalizable Robotics

YOR：您自己的通用移动操控机器人

Anjaria, Manan H, Erciyes, Mehmet Enes, Ghatnekar, Vedant, Navarkar, Neha, Etukuru, Haritheja, Jiang, Xiaole, Patel, Kanad, Kabra, Dhawal, Wojno, Nicholas, Prayage, Radhika Ajay, Chintala, Soumith, Pinto, Lerrel, Shafiullah, Nur Muhammad Mahi, Cui, Zichen Jeff

Abstract

Recent advances in robot learning have generated significant interest in capable platforms that may eventually approach human-level competence. This interest, combined with the commoditization of actuators, has propelled growth in low-cost robotic platforms. However, the optimal form factor for mobile manipulation, especially on a budget, remains an open question. We introduce YOR, an open-source, low-cost mobile manipulator that integrates an omnidirectional base, a telescopic vertical lift, and two arms with grippers to achieve whole-body mobility and manipulation. Our design emphasizes modularity, ease of assembly using off-the-shelf components, and affordability, with a bill-of-materials cost under 10,000 USD. We demonstrate YOR's capability by completing tasks that require coordinated whole-body control, bimanual manipulation, and autonomous navigation. Overall, YOR offers competitive functionality for mobile manipulation research at a fraction of the cost of existing platforms. Project website: https://www.yourownrobot.ai/

Chinese Translation

近年来，机器人学习的进展引发了对能够最终接近人类水平能力的平台的显著兴趣。这种兴趣，加上执行器的商品化，推动了低成本机器人平台的发展。然而，移动操控的最佳形态，尤其是在预算有限的情况下，仍然是一个未解的问题。我们介绍了YOR，一个开源、低成本的移动操控机器人，它集成了全向底盘、伸缩垂直升降机和两个带夹持器的手臂，以实现全身的移动和操控。我们的设计强调模块化、使用现成组件的组装简便性和经济性，材料成本低于10,000美元。我们通过完成需要协调全身控制、双手操控和自主导航的任务来展示YOR的能力。总体而言，YOR以现有平台成本的一小部分提供了竞争性的移动操控功能。项目网站：https://www.yourownrobot.ai/

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2602.10137

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

具有平滑注意特征融合的多编码器ConvNeXt网络用于多光谱语义分割

Ramos, Leo Thomas, Sappa, Angel D.

Abstract

This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

Chinese Translation

本研究提出了MeCSAFNet，一种用于多光谱图像土地覆盖分割的多分支编码器-解码器架构。该模型通过双ConvNeXt编码器分别处理可见和非可见通道，随后由各自的解码器重建空间信息。一个专用的融合解码器在多个尺度上整合中间特征，将细致的空间线索与高级光谱表示相结合。特征融合进一步通过CBAM注意机制增强，而ASAU激活函数则有助于稳定和高效的优化。该模型旨在处理不同的光谱配置，包括结合RGB和NIR波段的4通道（4c）输入，以及包含NDVI和NDWI指数的6通道（6c）输入。在Five-Billion-Pixels（FBP）和Potsdam数据集上的实验表明，该模型显著提升了性能。在FBP上，MeCSAFNet-base（6c）在mIoU上超越了U-Net（4c）+19.21%、U-Net（6c）+14.72%、SegFormer（4c）+19.62%和SegFormer（6c）+14.74%。在Potsdam上，MeCSAFNet-large（4c）在mIoU上较DeepLabV3+（4c）提高了+6.48%、DeepLabV3+（6c）+5.85%、SegFormer（4c）+9.11%和SegFormer（6c）+4.80%。该模型在多个最新的先进方法上也实现了一致的性能提升。此外，MeCSAFNet的紧凑变体在较低的训练时间和减少的推理成本下也表现出显著的性能，支持其在资源受限环境中的部署。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.10138

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

图表理解的多模态信息融合：多模态大型语言模型的演变、局限性与认知增强的综述

Yi, Zhihang, Zhao, Jian, Lv, Jiancheng, Wang, Tao

Abstract

Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

Chinese Translation

图表理解是一项典型的信息融合任务，需要无缝整合图形和文本数据以提取意义。多模态大型语言模型（MLLMs）的出现彻底改变了这一领域，但基于MLLM的图表分析仍然存在碎片化的现象，缺乏系统性的组织。本文综述提供了这一新兴前沿领域的全面路线图，通过构建领域的核心组件来进行系统化分析。我们首先分析了在图表中融合视觉和语言信息的基本挑战。接着，我们对下游任务和数据集进行了分类，提出了一种新的典型和非典型基准的分类法，以突出该领域不断扩展的范围。随后，我们全面回顾了方法论的演变，追溯了从经典深度学习技术到利用复杂融合策略的最先进MLLM范式的发展。通过批判性地审视当前模型的局限性，特别是它们在感知和推理方面的不足，我们识别出一些有前景的未来方向，包括先进的对齐技术和用于认知增强的强化学习。本文旨在为研究人员和从业者提供一个结构化的理解，阐明MLLM如何变革图表信息融合，并促进向更强大和可靠系统的进展。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.10143

MPA: Multimodal Prototype Augmentation for Few-Shot Learning

MPA：用于少样本学习的多模态原型增强

Wu, Liwen, Wang, Wei, Zhao, Lei, Gao, Zhan, Lin, Qika, Yao, Shaowen, Liu, Zuozhu, Pu, Bin

Abstract

Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

Chinese Translation

近年来，少样本学习（FSL）已成为一种热门任务，旨在仅通过少量标记示例识别新类别，并广泛应用于自然科学、遥感和医学图像等领域。然而，现有大多数方法仅关注视觉模态，并直接从原始支持图像计算原型，缺乏全面和丰富的多模态信息。为了解决这些局限性，我们提出了一种新颖的多模态原型增强FSL框架，称为MPA，其中包括基于大型语言模型的多变语义增强（LMSE）、分层多视图增强（HMA）和自适应不确定类吸收器（AUCA）。LMSE利用大型语言模型生成多样的释义类别描述，丰富支持集，提供额外的语义线索。HMA利用自然和多视图增强技术来提高特征多样性（例如，视距、相机角度和光照条件的变化）。AUCA通过插值和高斯采样引入不确定类来建模不确定性，有效吸收不确定样本。在四个单域和六个跨域FSL基准上的广泛实验表明，MPA在大多数设置下的性能优于现有的最先进方法。值得注意的是，在5-way 1-shot设置中，MPA在单域和跨域设置中分别超过第二好的方法12.29%和24.56%。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.10146

VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

VERA：识别和利用长文本理解中的视觉证据检索头

Pei, Rongcan, Li, Huan, Guo, Fang, Zhu, Qi

Abstract

While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

Chinese Translation

尽管视觉语言模型（VLMs）在文本理解方面展现出良好的前景，但在处理长文本和复杂推理任务时面临重大挑战。本文分析了VLMs中长文本处理的内部机制，以理解其性能瓶颈。通过注意力分析的视角，我们识别出特定的视觉证据检索（VER）头——一组稀疏且动态的注意力头，关键在于推理过程中定位视觉线索，这与静态光学字符识别（OCR）头不同。我们证明这些头对模型性能具有因果关系；对其进行屏蔽会导致显著的性能下降。基于这一发现，我们提出了VERA（视觉证据检索增强），这是一个无训练的框架，通过检测模型的不确定性（即熵）来触发VER头所关注的视觉证据的明确表述。全面的实验表明，VERA显著提升了开源VLMs的长文本理解能力：在五个基准测试中，Qwen3-VL-8B-Instruct的平均相对提升为21.3%，GLM-4.1V-Thinking的提升为20.1%。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.10159

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

超越封闭池视频检索：一个用于现实世界视频搜索和时刻定位的基准与智能体框架

Yu, Tao, Yang, Yujia, Jin, Haopeng, Gong, Junhao, Chen, Xinlong, Zhou, Yuxuan, Zhang, Shanbin, Yang, Jiabing, Wang, Xinming, Yi, Hongzhu, Nie, Ping, Zou, Kai, Zhang, Zhang, Huang, Yan, Wang, Liang, Yeshani, Tao, Ruiwen, Ma, Jin, Liang, Haijin, Luo, Jinwen

Abstract

Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

Chinese Translation

传统的视频检索基准侧重于将精确描述与封闭视频池进行匹配，未能反映现实世界中模糊、多维记忆的搜索特征。我们提出了 extbf{RVMS-Bench}，这是一个用于评估现实世界视频记忆搜索的综合系统。它包含 extbf{1,440 个样本}，涵盖 extbf{20 个多样化类别} 和 extbf{四个时长组}，样本来源于 extbf{现实世界的开放网络视频}。RVMS-Bench 利用一个层次描述框架，包括 extbf{全局印象、关键时刻、时间上下文和听觉记忆}，以模拟现实的多维搜索线索，所有样本均通过人机协作协议严格验证。我们进一步提出了 extbf{RACLO}，一个智能体框架，采用溯因推理来模拟人类的“回忆-搜索-验证”认知过程，有效应对通过模糊记忆在现实世界中搜索视频的挑战。实验表明，现有的多模态大语言模型（MLLMs）在基于模糊记忆的现实世界视频检索和时刻定位方面仍表现出不足的能力。我们相信这项工作将促进视频检索在现实世界非结构化场景中的鲁棒性发展。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.10160

AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

AD$^2$: 端到端自主驾驶系统视觉感知中的对抗威胁分析与检测

Sahu, Ishan, Hazra, Somnath, Aditya, Somak, Dey, Soumyajit

Abstract

End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

Chinese Translation

端到端自主驾驶系统已取得显著进展，但其对抗鲁棒性仍然在很大程度上未被深入探讨。在本研究中，我们在CARLA环境下对最先进的自主驾驶代理进行闭环评估，针对黑箱对抗威胁模型进行分析。具体而言，我们考虑了视觉感知管道上的三种典型攻击向量：(i) 由声波引起的基于物理的模糊攻击，(ii) 扭曲捕获图像的电磁干扰攻击，以及 (iii) 通过在图像上添加精心设计的有界扰动的幽灵物体的数字攻击。我们对两个先进代理Transfuser和Interfuser的实验显示，这些攻击存在严重的脆弱性，在最坏情况下，驾驶评分下降高达99%，引发了有效的安全隐患。为了帮助缓解这些威胁，我们进一步提出了一种基于注意力机制的轻量级自主驾驶系统攻击检测模型（AD$^2$），该模型能够捕捉时空一致性。在CARLA的多摄像头输入下进行的全面实验表明，我们的检测器在检测能力和计算效率上优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.10173

ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

ArtisanGS：基于AI与人机协作的高斯点选择交互工具

Tsang, Clement Fuji, Hu, Anita, Perel, Or, Kolve, Carsten, Shugrina, Maria

Abstract

Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.

Chinese Translation

3D高斯点（3D Gaussian Splats, 3DGS）表示法正在成为传统图形的一种可行替代方案，适用于越来越多的应用，包括最近促进物理仿真和动画的技术。然而，从实际捕获中提取可用对象仍然具有挑战性，并且对这种表示法的可控编辑技术有限。与大多数新兴技术专注于自动解决方案或高级编辑不同，我们提出了一套围绕多功能高斯点选择和分割的交互工具。我们提出了一种快速的AI驱动方法，将用户引导的2D选择掩模传播到3DGS选择中。这种技术允许用户在出现错误的情况下进行干预，并进一步结合灵活的手动选择和分割工具。这些工具使用户能够实现几乎任何非结构化3DGS场景的二元分割。我们将我们的工具集与高斯点选择的最新技术进行了评估，并通过开发一种用户引导的局部编辑方法，利用定制的视频扩散模型，展示了它们在下游应用中的实用性。通过灵活的选择工具，用户可以直接控制AI可以修改的区域。我们的选择和编辑工具可以用于任何实际捕获，而无需额外的优化。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.10179

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

当提示变为视觉：面向视觉的大型图像编辑模型的越狱攻击

Hou, Jiacheng, Sun, Yining, Jin, Ruochong, Han, Haochen, Liu, Fangming, Chan, Wai Kin Victor, Wang, Alex Jinpeng

Abstract

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

Chinese Translation

近年来，大型图像编辑模型的进展使得从文本驱动指令向视觉提示编辑的范式转变，用户意图直接通过视觉输入（如标记、箭头和视觉文本提示）进行推断。虽然这一范式大大扩展了可用性，但也引入了一个关键且未被充分探索的安全风险：攻击面本身变得可视化。在本研究中，我们提出了视觉中心越狱攻击（Vision-Centric Jailbreak Attack, VJA），这是首个通过视觉输入纯粹传达恶意指令的视觉到视觉的越狱攻击。为了系统性地研究这一新兴威胁，我们引入了IESBench，这是一个面向安全的图像编辑模型基准。对IESBench的广泛实验表明，VJA有效地攻破了最先进的商业模型，在Nano Banana Pro上实现了高达80.9%的攻击成功率，在GPT-Image-1.5上实现了70.1%的成功率。为了减轻这一脆弱性，我们提出了一种基于内省多模态推理的无训练防御方法，该方法显著提高了与商业系统相当的安全性，而无需辅助防护模型且计算开销微乎其微。我们的研究揭示了新的脆弱性，提供了基准和实用防御，以推动现代图像编辑系统的安全性和可信度。警告：本文包含由大型图像编辑模型生成的攻击性图像。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.10221

DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

DEGMC：基于黎曼等变群形态卷积的去噪扩散模型

Diop, El Hadji S., Fall, Thierno, Daoudi, Mohamed

Abstract

In this work, we address two major issues in recent Denoising Diffusion Probabilistic Models (DDPM): {\bf 1)} geometric key feature extraction and {\bf 2)} network equivariance. Since the DDPM prediction network relies on the U-net architecture, which is theoretically only translation equivariant, we introduce a geometric approach combined with an equivariance property of the more general Euclidean group, which includes rotations, reflections, and permutations. We introduce the notion of group morphological convolutions in Riemannian manifolds, which are derived from the viscosity solutions of first-order Hamilton-Jacobi-type partial differential equations (PDEs) that act as morphological multiscale dilations and erosions. We add a convection term to the model and solve it using the method of characteristics. This helps us better capture nonlinearities, represent thin geometric structures, and incorporate symmetries into the learning process. Experimental results on the MNIST, RotoMNIST, and CIFAR-10 datasets show noticeable improvements compared to the baseline DDPM model.

Chinese Translation

在本研究中，我们解决了近期去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM）中的两个主要问题：{f 1)} 几何关键特征提取和 {f 2)} 网络等变性。由于DDPM预测网络依赖于U-net架构，而该架构在理论上仅具有平移等变性，我们引入了一种几何方法，结合了更一般的欧几里得群的等变性特性，该群包括旋转、反射和置换。我们引入了在黎曼流形上的群形态卷积的概念，这些卷积源自一阶哈密顿-雅可比型偏微分方程（PDEs）的粘性解，这些方程作为形态多尺度膨胀和侵蚀的作用。我们在模型中添加了对流项，并使用特征方法求解。这有助于我们更好地捕捉非线性，表示细微的几何结构，并将对称性纳入学习过程。对MNIST、RotoMNIST和CIFAR-10数据集的实验结果显示，与基线DDPM模型相比，显著改善了性能。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.10239

XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

XSPLAIN：基于Splat的原型学习的可解释性框架，支持可解释人工智能

Galus, Dominik, Farganus, Julia, Zapala, Tymoteusz, Czachorowski, Mikołaj, Borycki, Piotr, Spurek, Przemysław, Syga, Piotr

Abstract

3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that'' reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4\% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: https://github.com/Solvro/ml-splat-xai

Chinese Translation

3D高斯Splatting（3DGS）迅速成为高保真3D重建的标准，但其在多个关键领域的应用受到生成模型和Splat分类缺乏可解释性的限制。虽然针对其他3D表示（如点云）存在可解释性方法，但它们通常依赖于模糊的显著性图，无法捕捉高斯原语的体积一致性。我们提出了XSPLAIN，这是第一个专门为3DGS分类设计的先验原型基础可解释性框架。我们的方法利用了体素聚合的PointNet骨干网络和一种新颖的可逆正交变换，能够解耦特征通道以实现可解释性，同时严格保持原始决策边界。解释基于具有代表性的训练示例，使得直观的“这个看起来像那个”的推理成为可能，而不会降低分类性能。一项严格的用户研究（N=51）表明，参与者选择XSPLAIN解释作为最佳解释的比例为48.4%，显著优于基线（p<0.001），显示出XSPLAIN提供了透明性和用户信任。该工作的源代码可在以下链接获取：https://github.com/Solvro/ml-splat-xai

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.10259

PMMA: The Polytechnique Montreal Mobility Aids Dataset

PMMA：蒙特利尔理工学院移动辅助设备数据集

Liu, Qingwu, Saunier, Nicolas, Bilodeau, Guillaume-Alexandre

Abstract

This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at https://doi.org/10.5683/SP3/XJPQUG, and the video processing and model training code is available at https://github.com/DatasetPMMA/PMMA.

Chinese Translation

本研究介绍了一个新的行人检测数据集，名为PMMA，该数据集包含使用移动辅助设备的行人。数据集是在户外环境中收集的，志愿者使用轮椅、拐杖和助行器，形成了九个类别的行人：行人、拐杖使用者、两种类型的助行器使用者（无论是行走还是休息）、五种类型的轮椅使用者，包括轮椅使用者、推空轮椅的人，以及三种类型的推占用轮椅的用户，包括整个推行组、推者和坐在轮椅上的人。为了建立基准，采用了七种目标检测模型（Faster R-CNN、CenterNet、YOLOX、DETR、Deformable DETR、DINO 和 RT-DETR）以及三种跟踪算法（ByteTrack、BOT-SORT 和 OC-SORT），并在MMDetection框架下实施。实验结果表明，YOLOX、Deformable DETR 和 Faster R-CNN 实现了最佳检测性能，而三种跟踪器之间的差异相对较小。PMMA数据集可在https://doi.org/10.5683/SP3/XJPQUG公开获取，视频处理和模型训练代码可在https://github.com/DatasetPMMA/PMMA获取。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.10265

Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

基于色度计监督的皮肤色调估计：来自皮肤镜图像的公平性审计

Benčević, Marin, Romić, Krešimir, Tolić, Ivana Hartmann, Galić, Irena

Abstract

Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.

Chinese Translation

基于神经网络的皮肤镜图像诊断在临床决策支持中越来越普遍，但研究报告显示不同肤色之间的性能差异。由于公共皮肤镜数据集中缺乏可靠的肤色标注，这些模型的公平性审计受到限制。我们通过神经网络解决了这一问题，该网络通过序数回归预测Fitzpatrick皮肤类型，并通过颜色回归预测个体类型角（ITA），以面对面获得的Fitzpatrick标签和色度计测量作为目标。我们进一步利用在合成和真实皮肤镜及临床图像上的广泛预训练。Fitzpatrick模型的结果与人类众包标注的协议相当，而ITA预测与色度计推导的ITA高度一致，显著优于像素平均方法。将这些估计器应用于ISIC 2020和MILK10k，我们发现少于1%的受试者属于Fitzpatrick类型V和VI。我们发布代码和预训练模型，作为快速肤色标注和偏见审计的开源工具。据我们所知，这是第一个经过色度计测量验证的皮肤镜肤色估计神经网络，并且支持日益增长的证据，表明不同肤色群体之间存在临床相关的性能差距。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.10278

ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

ERGO：基于超额风险引导的高保真单目3D高斯点云优化

Ma, Zehua, Li, Hanhui, Xie, Zhenyu, Luo, Xiaonan, Kampffmeyer, Michael, Gao, Feng, Liang, Xiaodan

Abstract

Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.

Chinese Translation

从单幅图像生成3D内容仍然是一个根本性挑战且不适定的问题，因为在遮挡区域缺乏几何和纹理信息。尽管最先进的生成模型能够合成辅助视图以提供额外的监督，但这些视图不可避免地包含几何不一致性和纹理错位，这在3D重建过程中会传播并放大伪影。为了有效利用这些不完美的监督信号，我们提出了一种基于超额风险分解的自适应优化框架，称为ERGO。具体而言，ERGO将3D高斯点云优化损失分解为两个组成部分，即量化当前参数与最优参数之间次优性差距的超额风险，以及建模合成视图中固有不可约噪声的贝叶斯误差。这种分解使得ERGO能够动态估计视图特定的超额风险，并在优化过程中自适应调整损失权重。此外，我们引入了几何感知和纹理感知目标，以补充基于超额风险的加权机制，建立协同的全局-局部优化范式。因此，ERGO在面对监督噪声时表现出鲁棒性，同时持续提升重建3D内容的几何保真度和纹理质量。在Google扫描对象数据集和OmniObject3D数据集上的大量实验表明，ERGO优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.10319

A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

一种针对扩散模型对抗攻击的低秩防御方法

Zhu, Jiaxuan, Huang, Siyu

Abstract

Recently, adversarial attacks for diffusion models as well as their fine-tuning process have been developed rapidly. To prevent the abuse of these attack algorithms from affecting the practical application of diffusion models, it is critical to develop corresponding defensive strategies. In this work, we propose an efficient defensive strategy, named Low-Rank Defense (LoRD), to defend the adversarial attack on Latent Diffusion Models (LDMs). LoRD introduces the merging idea and a balance parameter, combined with the low-rank adaptation (LoRA) modules, to detect and defend the adversarial samples. Based on LoRD, we build up a defense pipeline that applies the learned LoRD modules to help diffusion models defend against attack algorithms. Our method ensures that the LDM fine-tuned on both adversarial and clean samples can still generate high-quality images. To demonstrate the effectiveness of our approach, we conduct extensive experiments on facial and landscape images, and our method shows significantly better defense performance compared to the baseline methods.

Chinese Translation

近年来，针对扩散模型及其微调过程的对抗攻击迅速发展。为了防止这些攻击算法的滥用影响扩散模型的实际应用，开发相应的防御策略至关重要。在本研究中，我们提出了一种高效的防御策略，称为低秩防御（Low-Rank Defense, LoRD），以抵御对潜在扩散模型（Latent Diffusion Models, LDMs）的对抗攻击。LoRD引入了合并思想和一个平衡参数，并结合低秩适应（Low-Rank Adaptation, LoRA）模块，以检测和防御对抗样本。基于LoRD，我们建立了一个防御流程，应用学习到的LoRD模块来帮助扩散模型抵御攻击算法。我们的方法确保在对抗样本和干净样本上微调的LDM仍然能够生成高质量的图像。为了验证我们方法的有效性，我们在面部和风景图像上进行了广泛的实验，结果表明我们的方法在防御性能上显著优于基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.10326

Flow Matching with Uncertainty Quantification and Guidance

带有不确定性量化和引导的流匹配

Han, Juyeop, Beyer, Lukas Lao, Karaman, Sertac

Abstract

Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

Chinese Translation

尽管基于采样的生成模型如流匹配取得了显著成功，但它们仍然可能生成不一致或质量下降的样本。为了评估样本的可靠性并生成更高质量的输出，我们提出了不确定性感知流匹配（UA-Flow），这是流匹配的一个轻量级扩展，能够预测速度场及其异方差不确定性。UA-Flow通过在流动动态中传播速度不确定性来估计每个样本的不确定性。这些不确定性估计作为单个样本的可靠性信号，我们进一步利用它们通过不确定性感知分类器引导和无分类器引导来引导生成。图像生成实验表明，UA-Flow生成的不确定性信号与样本保真度的相关性高于基线方法，并且不确定性引导的采样进一步提高了生成质量。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.10343

Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

基于随机卷积神经网络的条件不确定性感知政治深度伪造检测

Gardoş, Rafael-Petruţ

Abstract

Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.

Chinese Translation

最近生成图像模型的进展使得高度逼真的政治深度伪造的创建成为可能，这对信息完整性、公众信任和民主过程构成了风险。尽管自动化的深度伪造检测器在内容审核和调查流程中越来越多地被部署，但现有大多数系统仅提供点预测，未能指示输出何时不可靠，这在高风险的政治环境中是一个操作上的关键限制。本研究探讨了使用随机卷积神经网络在经验性、决策导向的可靠性框架内进行条件不确定性感知的政治深度伪造检测。我们并非将不确定性视为纯粹的贝叶斯构造，而是通过可观察的标准进行评估，包括校准质量、适当评分规则，以及其在全局和置信度条件分析下与预测误差的对齐。通过对大型公共真实-合成语料库进行确定性元数据过滤，构建了一个以政治为中心的二元图像数据集。对两个预训练的CNN主干网络（ResNet-18和EfficientNet-B4）进行了全面的分类微调。将确定性推断与单次随机预测、蒙特卡洛丢弃法（Monte Carlo dropout）与多次前向推断、温度缩放和基于集成的不确定性替代品进行了比较。评估报告了ROC-AUC、阈值混淆矩阵、校准指标和生成器不重叠的分布外性能。结果表明，经过校准的概率输出和不确定性估计能够实现风险感知的审核政策。系统的置信区间分析进一步阐明了何时不确定性在超出预测置信度的情况下提供操作价值，划定了在政治环境中不确定性感知深度伪造检测的优势和局限性。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.10344

Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

带有散斑的数字全息的蒙特卡洛最大似然重建

Chen, Xi, Maleki, Arian, Jalali, Shirin

Abstract

In coherent imaging, speckle is statistically modeled as multiplicative noise, posing a fundamental challenge for image reconstruction. While maximum likelihood estimation (MLE) provides a principled framework for speckle mitigation, its application to coherent imaging system such as digital holography with finite apertures is hindered by the prohibitive cost of high-dimensional matrix inversion, especially at high resolutions. This computational burden has prevented the use of MLE-based reconstruction with physically accurate aperture modeling. In this work, we propose a randomized linear algebra approach that enables scalable MLE optimization without explicit matrix inversions in gradient computation. By exploiting the structural properties of sensing matrix and using conjugate gradient for likelihood gradient evaluation, the proposed algorithm supports accurate aperture modeling without the simplifying assumptions commonly imposed for tractability. We term the resulting method projected gradient descent with Monte Carlo estimation (PGD-MC). The proposed PGD-MC framework (i) demonstrates robustness to diverse and physically accurate aperture models, (ii) achieves substantial improvements in reconstruction quality and computational efficiency, and (iii) scales effectively to high-resolution digital holography. Extensive experiments incorporating three representative denoisers as regularization show that PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, consistently outperforming prior Plug-and-Play model-based iterative reconstruction methods in both accuracy and speed. Our code is available at: https://github.com/Computational-Imaging-RU/MC_Maximum_Likelihood_Digital_Holography_Speckle.

Chinese Translation

在相干成像中，散斑被统计建模为乘法噪声，这对图像重建构成了根本挑战。尽管最大似然估计（MLE）为散斑抑制提供了一个原则性框架，但其在有限孔径的数字全息等相干成像系统中的应用受到高维矩阵求逆的高昂成本的限制，尤其是在高分辨率下。这一计算负担阻碍了基于MLE的重建与物理准确的孔径建模的结合。在本研究中，我们提出了一种随机线性代数方法，使得在梯度计算中无需显式矩阵求逆即可实现可扩展的MLE优化。通过利用传感矩阵的结构特性，并使用共轭梯度法进行似然梯度评估，所提出的算法支持准确的孔径建模，而无需常见的简化假设以便于处理。我们将该方法称为带有蒙特卡洛估计的投影梯度下降（PGD-MC）。所提出的PGD-MC框架（i）对多样且物理准确的孔径模型表现出鲁棒性，(ii) 在重建质量和计算效率上实现了显著提升，(iii) 有效扩展到高分辨率数字全息。大量实验结合三种代表性的去噪器作为正则化表明，PGD-MC为有限孔径的数字全息提供了一个灵活且有效的基于MLE的重建框架，在准确性和速度上始终优于先前的插拔式模型基础迭代重建方法。我们的代码可在以下链接获取：https://github.com/Computational-Imaging-RU/MC_Maximum_Likelihood_Digital_Holography_Speckle.

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.10364

Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

Comp2Comp：具有FDA批准的人工智能算法的开源软件，用于计算机断层扫描图像分析

Rao, Adrit, Jensen, Malte, Fisher, Andrea T., Blankemeier, Louis, Berens, Pauline, Fereydooni, Arash, Lirette, Seth, Alkan, Eren, Kitamura, Felipe C., Chaves, Juan M. Zambrano, Reis, Eduardo, Desai, Arjun, Willis, Marc H., Hom, Jason, Johnston, Andrew, Lenchik, Leon, Boutin, Robert D., Farina, Eduardo M. J. M., Serpa, Augusto S., Takahashi, Marcelo S., Perchik, Jordan, Rothenberg, Steven A., Schroeder, Jamie L., Filice, Ross, Bittencourt, Leonardo K., Trivedi, Hari, van Assen, Marly, Mongan, John, Kallianos, Kimberly, Aalami, Oliver, Chaudhari, Akshay S.

Abstract

Artificial intelligence allows automatic extraction of imaging biomarkers from already-acquired radiologic images. This paradigm of opportunistic imaging adds value to medical imaging without additional imaging costs or patient radiation exposure. However, many open-source image analysis solutions lack rigorous validation while commercial solutions lack transparency, leading to unexpected failures when deployed. Here, we report development and validation for two of the first fully open-sourced, FDA-510(k)-cleared deep learning pipelines to mitigate both challenges: Abdominal Aortic Quantification (AAQ) and Bone Mineral Density (BMD) estimation are both offered within the Comp2Comp package for opportunistic analysis of computed tomography scans. AAQ segments the abdominal aorta to assess aneurysm size; BMD segments vertebral bodies to estimate trabecular bone density and osteoporosis risk. AAQ-derived maximal aortic diameters were compared against radiologist ground-truth measurements on 258 patient scans enriched for abdominal aortic aneurysms from four external institutions. BMD binary classifications (low vs. normal bone density) were compared against concurrent DXA scan ground truths obtained on 371 patient scans from four external institutions. AAQ had an overall mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm). BMD had a sensitivity of 81.0% (95% CI 74.0-86.8%) and specificity of 78.4% (95% CI 72.3-83.7%). Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these algorithms improves transparency of typically opaque FDA clearance processes, allows hospitals to test the algorithms before cumbersome clinical pilots, and provides researchers with best-in-class methods.

Chinese Translation

人工智能允许从已获取的放射影像中自动提取影像生物标志物。这种机会性影像的范式为医学影像增加了价值，而无需额外的影像成本或患者辐射暴露。然而，许多开源图像分析解决方案缺乏严格的验证，而商业解决方案则缺乏透明度，导致在部署时出现意外失败。在此，我们报告了两个首个完全开源、获得FDA 510(k)批准的深度学习管道的开发和验证，以应对这两个挑战：腹主动脉定量（Abdominal Aortic Quantification, AAQ）和骨密度（Bone Mineral Density, BMD）估计均在Comp2Comp包中提供，用于计算机断层扫描的机会性分析。AAQ对腹主动脉进行分割以评估动脉瘤大小；BMD对椎体进行分割以估计松质骨密度和骨质疏松风险。AAQ得出的最大主动脉直径与来自四个外部机构的258个患者扫描中的放射科医师真实测量值进行了比较，这些扫描中富含腹主动脉瘤。BMD的二元分类（低骨密度与正常骨密度）与来自四个外部机构的371个患者扫描中的同时DXA扫描真实值进行了比较。AAQ的整体平均绝对误差为1.57毫米（95%置信区间1.38-1.80毫米）。BMD的灵敏度为81.0%（95%置信区间74.0-86.8%），特异性为78.4%（95%置信区间72.3-83.7%）。Comp2Comp的AAQ和BMD展示了足够的临床使用准确性。开源这些算法提高了通常不透明的FDA批准过程的透明度，使医院能够在繁琐的临床试点之前测试这些算法，并为研究人员提供最佳实践方法。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.10425

HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

HII-DPO：通过准确的诱发幻觉的反事实图像消除幻觉

Yang, Yilin, Guo, Zhenghui, Wang, Yuke, Gnawali, Omprakash, Di, Sheng, Zhang, Chengming

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

Chinese Translation

大型视觉语言模型（VLMs）在多种多模态任务中取得了显著成功，但仍然容易受到根植于固有语言偏见的幻觉影响。尽管近期取得了一些进展，现有的幻觉缓解方法往往忽视了由语言偏见驱动的潜在幻觉模式。在本研究中，我们设计了一种新颖的流程，以准确合成诱发幻觉图像（HIIs）。通过使用合成的HIIs，我们揭示了一种一致的场景条件幻觉模式：即使在移除视觉证据的情况下，模型仍倾向于提及与场景高度典型的物体。为了量化VLMs对这种幻觉模式的敏感性，我们建立了被遮蔽物体幻觉（MOH）基准，以严格评估现有的最先进对齐框架。最后，我们利用HIIs构建高质量的偏好数据集，以实现细粒度对齐。实验结果表明，我们的方法有效地减轻了幻觉，同时保留了模型的整体能力。具体而言，我们的方法在标准幻觉基准上实现了高达38%的改进，超越了当前的最先进水平。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.10491

Towards Remote Sensing Change Detection with Neural Memory

基于神经记忆的遥感变化检测研究

Yang, Zhenyu, Pei, Gensheng, Yao, Yazhou, Zhou, Tianfei, Ding, Lizhong, Shen, Fumin

Abstract

Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36\%} IoU and \textbf{91.52\%} F1-score on LEVIR-CD, while remaining computationally competitive.

Chinese Translation

遥感变化检测对于环境监测、城市规划及相关应用至关重要。然而，当前的方法往往难以在保持计算效率的同时捕捉长程依赖关系。尽管 Transformers 能有效建模全局上下文，但其二次复杂度带来了可扩展性挑战，而现有的线性注意力方法常常无法捕捉复杂的时空关系。受到 Titans 在语言任务中取得的成功启发，我们提出了 ChangeTitans，一个基于 Titans 的遥感变化检测框架。具体而言，我们提出了 VTitans，这是第一个将神经记忆与分段局部注意力相结合的 Titans 基础视觉骨干网络，从而在减轻计算负担的同时捕捉长程依赖关系。接下来，我们提出了层次化的 VTitans-Adapter，以在不同网络层之间细化多尺度特征。最后，我们引入了 TS-CBAM，一个利用跨时间注意力的双流融合模块，以抑制伪变化并增强检测准确性。在四个基准数据集（LEVIR-CD、WHU-CD、LEVIR-CD+ 和 SYSU-CD）上的实验评估表明，ChangeTitans 达到了最先进的结果，在 LEVIR-CD 上获得了 extbf{84.36\%} 的 IoU 和 extbf{91.52\ ext%} 的 F1-score，同时保持了计算竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.10492

End-to-End LiDAR optimization for 3D point cloud registration

端到端激光雷达优化用于三维点云配准

Katyan, Siddhant, Gardner, Marc-André, Lalonde, Jean-François

Abstract

LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.

Chinese Translation

激光雷达（LiDAR）传感器是三维感知的关键模式，但它们通常是独立于下游任务（如点云配准）设计的。传统的配准方法在预先获取的数据集上进行，使用固定的激光雷达配置，这导致数据采集不够优化，并且在采样、噪声过滤和参数调优方面产生了显著的计算开销。在本研究中，我们提出了一种自适应激光雷达感知框架，该框架动态调整传感器参数，联合优化激光雷达采集和配准超参数。通过将配准反馈集成到感知循环中，我们的方法在点密度、噪声和稀疏性之间实现了最佳平衡，从而提高了配准的准确性和效率。在CARLA仿真中的评估表明，我们的方法优于固定参数基线，同时保持了良好的泛化能力，突显了自适应激光雷达在自主感知和机器人应用中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.10495

Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

多分辨率哈希编码的空间核特征及优化

Dai, Tianxiang, Fan, Jonathan

Abstract

Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green's function of the system. This methodology enables a quantification of the encoding's spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, $N_{\text{avg}}$. This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore $N_{\text{avg}}$), rather than the finest resolution $N_{\max}$, a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.

Chinese Translation

多分辨率哈希编码（Multi-Resolution Hash Encoding, MHE）是即时神经图形原语的基础技术，为神经场提供了强大的参数化。然而，从物理系统的角度来看，其空间行为缺乏严格的理解，导致在超参数选择上依赖启发式方法。本研究引入了一种新颖的分析方法，通过考察MHE的点扩散函数（Point Spread Function, PSF）来表征MHE，该函数类似于系统的格林函数。这一方法使得能够量化编码的空间分辨率和保真度。我们推导出无碰撞PSF的闭式近似，揭示了固有的网格诱导各向异性和对数空间轮廓。我们建立了理想化空间带宽，特别是半最大宽度（Full Width at Half Maximum, FWHM），由平均分辨率$N_{ ext{avg}}$决定。这导致一个反直觉的发现：模型的有效分辨率由扩展的经验FWHM（因此$N_{ ext{avg}}$）所主导，而非最细分辨率$N_{ ext{max}}$，这种扩展效应是我们通过优化动态所展示的。此外，我们分析了有限哈希容量的影响，展示了碰撞如何引入斑点噪声并降低信噪比（Signal-to-Noise Ratio, SNR）。利用这些理论见解，我们提出了旋转多分辨率哈希编码（Rotated MHE, R-MHE），该架构在每个分辨率级别对输入坐标应用不同的旋转。R-MHE在保持原始MHE的效率和参数数量的同时，减轻了各向异性。本研究建立了一种基于物理原理的方法，超越启发式方法，以表征和优化MHE。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.10500

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

垃圾数据集 (GD)：用于自动化垃圾分类的多类别图像基准

Kunwar, Suman

Abstract

This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It's a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.

Chinese Translation

本研究介绍了垃圾数据集 (GD)，这是一个公开可用的图像数据集，旨在通过机器学习和计算机视觉推动自动化垃圾分类。该数据集涵盖了10种常见的家庭垃圾类别：金属、玻璃、生物、纸张、电池、垃圾、纸板、鞋子、衣物和塑料。数据集包含13,348张标注图像，这些图像通过多种方法收集，包括DWaste移动应用程序和精心策划的网络来源。方法包括通过校验和和异常值检测进行严格验证，通过主成分分析 (PCA)/t-SNE分析类别不平衡和视觉可分离性，以及使用熵和显著性度量评估背景复杂性。该数据集使用最先进的深度学习模型（EfficientNetV2M、EfficientNetV2S、MobileNet、ResNet50、ResNet101）进行了基准测试，评估了性能指标和运营碳排放。实验结果表明，EfficientNetV2S以96.19%的准确率和0.96的F1-score达到了最高性能，尽管其碳成本适中。分析揭示了数据集的固有特征，包括类别不平衡、向高异常类别（塑料、纸板、纸张）的偏斜，以及需要考虑的亮度变化。主要结论是，GD为垃圾分类研究提供了一个有价值的现实世界基准，同时突出了在实际部署中必须解决的重要挑战，如类别不平衡、背景复杂性和模型选择中的环境权衡。该数据集已公开发布，以支持在环境可持续性应用方面的进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.10508

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Med-SegLens：用于可解释医学图像分割的潜在级别模型差异分析

Ahmed, Salma J., Mohammed, Emad A., Bidgoli, Azam Asilian

Abstract

Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

Chinese Translation

现代分割模型在预测性能上表现出色，但仍然在很大程度上不透明，这限制了我们诊断失败、理解数据集偏移或以原则性方式进行干预的能力。我们提出了Med-SegLens，一种模型差异分析框架，通过使用在SegFormer和U-Net上训练的稀疏自编码器，将分割模型的激活分解为可解释的潜在特征。通过在健康、成人、儿童和撒哈拉以南非洲胶质瘤队列中进行跨架构和跨数据集的潜在对齐，我们识别出一组稳定的共享表示，而数据集偏移则是由于对特定人群潜在特征的差异依赖所驱动。我们展示了这些潜在特征作为分割失败的因果瓶颈，并且针对潜在级别的干预可以在不重新训练的情况下纠正错误并改善跨数据集的适应性，在70%的失败案例中恢复性能，并将Dice分数从39.4%提高到74.2%。我们的结果表明，潜在级别模型差异分析为诊断分割模型中的失败和减轻数据集偏移提供了一个实用且机制化的工具。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.10513

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

1%>100%：具有复杂线性投影优化的高效视觉适配器

Yin, Dongshuo, Yang, Xue, Fan, Deng-Ping, Hu, Shi-Min

Abstract

Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.

Chinese Translation

部署视觉基础模型通常依赖于高效的适配策略，而传统的完全微调则面临高昂的成本和低效率的问题。虽然增量微调（delta-tuning）在适配过程中已被证明能够有效提升大语言模型（LLMs）的性能和效率，但其优势无法直接转移到视觉基础模型的微调流程中。为了推动视觉任务的适配效率，我们提出了一种具有复杂线性投影优化（Complex Linear Projection Optimization, CoLin）的适配器。在架构方面，我们设计了一种新颖的低秩复杂适配器，仅向主干网络引入约1%的参数。在效率方面，我们理论证明低秩复合矩阵在训练过程中存在严重的收敛问题，并通过量身定制的损失函数来解决这一挑战。在目标检测、分割、图像分类和旋转物体检测（遥感场景）等任务上的大量实验表明，CoLin首次以仅1%的参数超越了完全微调和经典增量微调方法，为视觉基础模型的部署提供了一种新颖而高效的解决方案。我们在 https://github.com/DongshuoYin/CoLin 上发布了代码。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.10516

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

3DXTalker：在表现力丰富的3D会话化身中统一身份、口型同步、情感和空间动态

Wang, Zhongju, Sun, Zhenhong, Wang, Beier, Wang, Yifu, Dong, Daoyi, Mo, Huadong, Li, Hongdong

Abstract

Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

Chinese Translation

基于音频驱动的3D会话化身生成在虚拟沟通、数字人类和互动媒体中变得越来越重要，其中化身必须保持身份、与语音同步口型运动、表达情感并展示逼真的空间动态，这共同定义了表现力的更广泛目标。然而，由于训练数据不足、受限的主体身份、狭窄的音频表现和有限的显式可控性，实现这一目标仍然具有挑战性。在本文中，我们提出了3DXTalker，一种通过数据策划的身份建模、丰富的音频表现和空间动态可控性来实现的表现力丰富的3D会话化身。3DXTalker通过2D到3D的数据策划管道和解耦表示实现可扩展的身份建模，缓解数据稀缺问题并改善身份泛化。然后，我们引入了超越标准语音嵌入的逐帧幅度和情感线索，确保优越的口型同步和细腻的表达调节。这些线索通过基于流匹配的变换器统一，以实现连贯的面部动态。此外，3DXTalker还能够生成自然的头部姿态运动，同时支持通过基于提示的条件进行风格化控制。大量实验表明，3DXTalker在统一框架内整合了口型同步、情感表达和头部姿态动态，在3D会话化身生成中实现了卓越的性能。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.10518

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

MapVerse：多样化真实世界地图上的地理空间问答基准

Bhat, Sharat, Khandelwal, Harshita, Kataria, Tushar, Gupta, Vivek

Abstract

Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

Chinese Translation

地图是结构化和上下文知识的强大载体，涵盖了地理、人口统计、基础设施和环境模式。对这些知识进行推理要求模型整合空间关系、视觉线索、现实世界背景和领域特定的专业知识，而当前的大型语言模型（LLMs）和视觉语言模型（VLMs）在这一点上仍然难以保持一致。然而，用于基准测试VLMs在基于地图推理方面的数据集范围狭窄，局限于特定领域，并且严重依赖于人工生成的内容（来自LLMs或基于管道的方法的输出），在评估真实的地理空间推理方面提供的深度有限。为了解决这一问题，我们提出了MapVerse，一个基于真实世界地图的大规模基准。它包含了11,837对人类创作的问题-答案对，涵盖1,025张地图，跨越十个多样化的地图类别和每个类别的多个问题类型。该数据集为评估地图阅读、解释和多模态推理提供了丰富的环境。我们对十个最先进的模型进行了基准测试，以建立基线并量化推理差距。除了整体性能外，我们还进行了细致的分类分析，以评估模型在多个维度上的推理能力，并调查影响推理结果的视觉因素。我们的研究发现，尽管当前的VLMs在分类风格任务上表现竞争力，但无论是开源还是闭源模型在需要复杂空间推理的高级任务上均表现不足。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.10546

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

RealHD：用于鲁棒检测最先进的人工智能生成图像的高质量数据集

Yu, Hanzhe, Ye, Yun, Rong, Jintao, Xuan, Qi, Ma, Chen

Abstract

The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at https://real-hd.github.io.

Chinese Translation

生成性人工智能的快速发展引发了对数字图像真实性的担忧，因为现在可以以低成本生成高度逼真的假图像，这可能增加社会风险。为此，已经建立了多个数据集，以训练检测模型，旨在区分人工智能生成的图像和真实图像。然而，现有数据集存在泛化能力有限、图像质量低、提示过于简单以及图像多样性不足等问题。为了解决这些局限性，我们提出了一个高质量的大规模数据集，包含超过730,000张图像，涵盖多个类别，包括真实图像和人工智能生成的图像。生成的图像是通过最先进的方法合成的，包括文本到图像生成（基于超过10,000个精心设计的提示）、图像修复、图像精细化和人脸替换。每个生成的图像都标注了其生成方法和类别。修复图像还包括二进制掩码，以指示修复区域，为分析提供丰富的元数据。与现有数据集相比，基于我们的数据集训练的检测模型表现出更强的泛化能力。我们的数据集不仅作为评估检测方法的强基准，还促进了人工智能生成图像检测技术的鲁棒性提升。在此基础上，我们提出了一种基于图像噪声熵的轻量级检测方法，该方法在分类之前将原始图像转换为非局部均值（Non-Local Means, NLM）噪声的熵张量。大量实验表明，在我们的数据集上训练的模型实现了强泛化，而我们的方法也展现了竞争力的性能，为未来研究奠定了坚实的基准。数据集和源代码已公开发布，网址为 https://real-hd.github.io。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.10549

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

通过文本指导增强弱监督多模态视频异常检测

Sun, Shengyang, Hua, Jiashen, Feng, Junyi, Gong, Xiaojin

Abstract

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

Chinese Translation

弱监督多模态视频异常检测受到了广泛关注，但文本模态的潜力仍未得到充分挖掘。文本提供了明确的语义信息，可以增强异常特征的表征并减少误报。然而，由于通用语言模型无法捕捉特定于异常的细微差别以及相关描述的稀缺，提取有效的文本特征面临挑战。此外，多模态融合常常存在冗余和不平衡的问题。为了解决这些问题，我们提出了一种新颖的文本指导框架。首先，我们引入了一种基于上下文学习的多阶段文本增强机制，以生成高质量的异常文本样本，从而微调文本特征提取器。其次，我们设计了一个多尺度瓶颈Transformer融合模块，利用压缩的瓶颈令牌逐步整合跨模态的信息，从而减轻冗余和不平衡。我们在UCF-Crime和XD-Violence数据集上的实验表明，该方法达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.10551

C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

C^2ROPE：用于3D大型多模态模型推理的因果连续旋转位置编码

Ye, Guanting, Zhao, Qiyan, Yu, Wenhao, Zhang, Xiaofeng, Ji, Jianmin, Zhang, Yanyong, Yuen, Ka-Veng

Abstract

Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.

Chinese Translation

基于大型语言模型（LLMs）的3D大型多模态模型（LMMs）的最新进展已确立了3D视觉特征与LLM表示之间的对齐作为主导范式。然而，继承的旋转位置嵌入（RoPE）在多模态处理上引入了限制。具体而言，应用一维时间位置索引破坏了视觉特征在列维度上的连续性，导致空间局部性丧失。此外，RoPE遵循了时间上更接近的图像标记更具因果关系的先验，导致注意力分配的长期衰减，并使得模型在序列长度增加时逐渐忽视早期的视觉标记。为了解决这些问题，我们提出了C^2RoPE，一种改进的RoPE，明确建模视觉处理中的局部空间连续性和空间因果关系。C^2RoPE为视觉标记引入了一种时空连续位置嵌入机制。它首先将一维时间位置与基于笛卡尔坐标的空间坐标相结合，构建三元组混合位置索引，然后采用频率分配策略在三个索引组件之间编码时空位置的信息。此外，我们引入了切比雪夫因果掩蔽，通过计算图像标记在二维空间中的切比雪夫距离来确定因果依赖关系。在包括3D场景推理和3D视觉问答在内的各种基准测试中的评估结果表明，C^2RoPE的有效性。代码可在 https://github.com/ErikZ719/C2RoPE 获取。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.10575

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

MetaphorStar：基于端到端视觉强化学习的图像隐喻理解与推理

Zhang, Chenhao, Niu, Yazhe, Li, Hongsheng

Abstract

Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.

Chinese Translation

图像中的隐喻理解仍然是当今人工智能系统面临的一个关键挑战。尽管多模态大型语言模型（MLLMs）在基本的视觉问答（VQA）任务中表现出色，但它们在理解视觉内容中蕴含的细微文化、情感和上下文含义方面始终存在困难。这一困难源于该任务对复杂多跳推理、文化背景和心智理论（Theory of Mind, ToM）能力的需求，而当前模型在这些方面存在不足。为填补这一空白，我们提出了MetaphorStar，这是首个用于图像隐喻任务的端到端视觉强化学习（RL）框架。我们的框架包括三个核心组件：细粒度数据集TFQ-Data、视觉RL方法TFQ-GRPO，以及结构良好的基准TFQ-Bench。我们完全开源的MetaphorStar系列，使用TFQ-GRPO在TFQ-Data上训练，显著提高了图像隐喻基准的平均性能，提升幅度达到82.6%。与20多个主流MLLMs相比，MetaphorStar-32B在多项选择题和开放式问题上达到了最先进的（SOTA）水平，在真伪问题上显著超越了顶尖的闭源模型Gemini-3.0-pro。关键的是，我们的实验表明，学习图像隐喻任务能够提升一般理解能力，尤其是复杂的视觉推理能力。我们进一步提供了模型参数扩展、训练数据扩展以及不同模型架构和训练策略影响的系统分析，展示了我们方法的广泛适用性。我们已在https://metaphorstar.github.io开源所有模型权重、数据集和方法代码。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.10586

Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

通过自适应语义感知代码本学习增强水下图像

Lin, Bosen, Gao, Feng, Yu, Yanwei, Dong, Junyu, Du, Qian

Abstract

Underwater Image Enhancement (UIE) is an ill-posed problem where natural clean references are not available, and the degradation levels vary significantly across semantic regions. Existing UIE methods treat images with a single global model and ignore the inconsistent degradation of different scene components. This oversight leads to significant color distortions and loss of fine details in heterogeneous underwater scenes, especially where degradation varies significantly across different image regions. Therefore, we propose SUCode (Semantic-aware Underwater Codebook Network), which achieves adaptive UIE from semantic-aware discrete codebook representation. Compared with one-shot codebook-based methods, SUCode exploits semantic-aware, pixel-level codebook representation tailored to heterogeneous underwater degradation. A three-stage training paradigm is employed to represent raw underwater image features to avoid pseudo ground-truth contamination. Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) jointly integrate channel and frequency cues for faithful color restoration and texture recovery. Extensive experiments on multiple benchmarks demonstrate that SUCode achieves state-of-the-art performance, outperforming recent UIE methods on both reference and no-reference metrics. The code will be made public available at https://github.com/oucailab/SUCode.

Chinese Translation

水下图像增强（UIE）是一个不适定问题，因为缺乏自然清晰的参考图像，并且不同语义区域的退化程度差异显著。现有的UIE方法采用单一全局模型处理图像，忽视了不同场景组件的退化不一致性。这一忽视导致了显著的颜色失真和异质水下场景中细节的丢失，尤其是在不同图像区域的退化程度差异较大的情况下。因此，我们提出了SUCode（语义感知水下代码本网络），该网络通过语义感知的离散代码本表示实现自适应UIE。与一次性代码本方法相比，SUCode利用针对异质水下退化量身定制的语义感知像素级代码本表示。我们采用三阶段训练范式来表示原始水下图像特征，以避免伪地面真相的污染。门控通道注意模块（GCAM）和频率感知特征融合（FAFF）共同整合通道和频率线索，实现真实的颜色恢复和纹理重建。在多个基准测试上的大量实验表明，SUCode实现了最先进的性能，在参考和无参考指标上均优于近期的UIE方法。代码将公开发布于 https://github.com/oucailab/SUCode。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.10592

Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

增强 YOLOv11n 以在嘈杂监控视频中可靠检测儿童

Tran, Khanh Linh, Dang, Minh Nguyen, Trong, Thien Nguyen, Quoc, Hung Nguyen, Kieu, Linh Nguyen

Abstract

This paper presents a practical and lightweight solution for enhancing child detection in low-quality surveillance footage, a critical component in real-world missing child alert and daycare monitoring systems. Building upon the efficient YOLOv11n architecture, we propose a deployment-ready pipeline that improves detection under challenging conditions including occlusion, small object size, low resolution, motion blur, and poor lighting commonly found in existing CCTV infrastructures. Our approach introduces a domain-specific augmentation strategy that synthesizes realistic child placements using spatial perturbations such as partial visibility, truncation, and overlaps, combined with photometric degradations including lighting variation and noise. To improve recall of small and partially occluded instances, we integrate Slicing Aided Hyper Inference (SAHI) at inference time. All components are trained and evaluated on a filtered, child-only subset of the Roboflow Daycare dataset. Compared to the baseline YOLOv11n, our enhanced system achieves a mean Average Precision at 0.5 IoU ([email protected]) of 0.967 and a mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 ([email protected]:0.95) of 0.783, yielding absolute improvements of 0.7 percent and 2.3 percent, respectively, without architectural changes. Importantly, the entire pipeline maintains compatibility with low-power edge devices and supports real-time performance, making it particularly well suited for low-cost or resource-constrained industrial surveillance deployments. The example augmented dataset and the source code used to generate it are available at: https://github.com/html-ptit/Data-Augmentation-YOLOv11n-child-detection

Chinese Translation

本文提出了一种实用且轻量级的解决方案，以增强低质量监控视频中的儿童检测，这是现实世界中失踪儿童警报和托儿所监控系统的关键组成部分。在高效的 YOLOv11n 架构基础上，我们提出了一种即用型管道，旨在改善在包括遮挡、小物体尺寸、低分辨率、运动模糊和通常存在于现有 CCTV 基础设施中的光照不足等挑战性条件下的检测。我们的方法引入了一种特定领域的数据增强策略，通过空间扰动（如部分可见性、截断和重叠）合成逼真的儿童位置，并结合光度退化（包括光照变化和噪声）。为了提高小型和部分遮挡实例的召回率，我们在推理时集成了切片辅助超推理（Slicing Aided Hyper Inference, SAHI）。所有组件均在经过筛选的仅包含儿童的 Roboflow 托儿所数据集子集上进行训练和评估。与基线 YOLOv11n 相比，我们增强的系统在 0.5 IoU 下的平均精度（[email protected]）达到 0.967，在 IoU 阈值从 0.5 到 0.95 的平均精度（[email protected]:0.95）为 0.783，分别实现了 0.7% 和 2.3% 的绝对提升，而无需进行架构更改。重要的是，整个管道与低功耗边缘设备兼容，并支持实时性能，使其特别适合低成本或资源受限的工业监控部署。示例增强数据集及其生成源代码可在以下链接获取：https://github.com/html-ptit/Data-Augmentation-YOLOv11n-child-detection

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.10593

Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

基于YOLOX与AI加速器的快速人员检测用于火车站安全

Achmadiah, Mas Nurul, Setyawan, Novendra, Bryantono, Achmad Arif, Sun, Chi-Chia, Kuo, Wen-Kai

Abstract

Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).

Chinese Translation

近年来，图像处理技术迅速发展，并广泛应用于多个领域，包括健康、工业和交通。在交通领域，物体检测被广泛用于提高安全性，例如在交通安全和火车站乘客过道方面。一些事故发生在车站的火车交叉区域，例如乘客在经过黄线时不小心。因此，需要进一步开发安全措施。需要额外的技术来减少事故数量。本文重点研究在火车站使用YOLOX和边缘AI加速器硬件进行乘客检测的应用。将AI加速器的性能与Jetson Orin Nano进行比较。实验结果表明，Hailo-8 AI硬件加速器的准确性高于Jetson Orin Nano（提高超过12%），并且延迟低于Jetson Orin Nano（减少20毫秒）。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.10619

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

通过感知和推理增强改善医学视觉强化微调

Yang, Guangjing, Yu, ZhangYuan, Qin, Ziyuan, Song, Xinyuan, Yi, Huahui, Kang, Qingbo, Gao, Jun, Li, Yiyue, Du, Chenlin, Lao, Qicheng

Abstract

While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.

Chinese Translation

尽管最近在强化微调（Reinforcement Fine-Tuning, RFT）方面的进展表明，基于规则的奖励机制能够有效地进行大型语言模型的后训练，但其在跨模态、以视觉为中心的领域的扩展仍然很大程度上未被探索。这一局限性在医学影像领域尤为明显，因为有效的性能需要强大的视觉感知和结构化推理。在本研究中，我们通过提出VRFT-Aug，一个针对医学领域量身定制的视觉强化微调框架，来填补这一空白。VRFT-Aug引入了一系列旨在增强感知和推理的训练策略，包括先验知识注入、基于感知的策略优化、医学知识驱动的奖励塑造和行为模仿。这些方法共同旨在稳定和改善RFT过程。通过在多个医学数据集上进行广泛实验，我们展示了我们的方法在性能上始终优于标准监督微调和RFT基线。此外，我们提供了经验基础的见解和实用的训练启发式，这些可以推广到其他医学影像任务。我们希望这项工作为开发可靠且具备推理能力的高风险医学应用模型的持续努力提供可操作的指导和新的灵感。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.10624

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

用于零-shot临床协作和皮肤病学自动概念发现的视觉-语言基础模型

Yan, Siyuan, Li, Xieji, Mo, Dan, Tschandl, Philipp, Jiang, Yiwen, Wang, Zhonghua, Hu, Ming, Ju, Lie, Vico-Alonso, Cristina, Zheng, Yizhen, Liu, Jiahe, Zhou, Juexiao, Chello, Camilla, Cheung, Jen G., Anriot, Julien, Thomas, Luc, Primiero, Clare, Tan, Gin, Ng, Aik Beng, See, Simon, Tang, Xiaoying, Ip, Albert, Liao, Xiaoyang, Bowling, Adrian, Haskett, Martin, Zhao, Shuang, Janda, Monika, Soyer, H. Peter, Mar, Victoria, Kittler, Harald, Ge, Zongyuan

Abstract

Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

Chinese Translation

医学基础模型在受控基准测试中显示出良好的前景，但广泛应用仍受到对特定任务微调依赖的限制。在此，我们介绍了DermFM-Zero，这是一种通过掩蔽潜在建模和对比学习在超过400万多模态数据点上训练的皮肤病学视觉-语言基础模型。我们在20个基准测试中评估了DermFM-Zero，涵盖零-shot诊断和多模态检索，取得了无任务特定适应的最先进性能。我们进一步在三项涉及超过1100名临床医生的跨国读者研究中评估了其零-shot能力。在初级保健环境中，AI辅助使全科医生在98种皮肤病的鉴别诊断准确率几乎翻倍。在专业环境中，该模型在多模态皮肤癌评估中显著超越了获得认证的皮肤科医生。在协作工作流程中，AI辅助使非专家的表现超过了未受助的专家，同时改善了管理的适当性。最后，我们展示了DermFM-Zero的潜在表示是可解释的：稀疏自编码器无监督地解开了临床上有意义的概念，其表现优于预定义词汇的方法，并能够针对性地抑制由伪影引起的偏差，增强了鲁棒性而无需重新训练。这些发现表明，基础模型可以提供有效、安全和透明的零-shot临床决策支持。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.10630

Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

消除变分自编码器以实现快速高分辨率生成细节恢复

Wang, Yan, Zhao, Shijie, Li, Junlin, Zhang, Li

Abstract

Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.

Chinese Translation

扩散模型在现实世界超分辨率（SR）任务中取得了显著突破，尽管推理速度较慢且对设备要求较高。为了加速推理，最近的工作如GenDR采用步骤蒸馏将步骤数最小化至1。然而，内存限制仍然限制了最大处理尺寸， necessitating对高分辨率图像进行逐块恢复。通过对流程的分析，我们发现变分自编码器（VAE）是延迟和内存的瓶颈。为了解决这个问题，我们利用像素（反）洗牌操作消除VAE，将基于潜变量的GenDR转变为像素空间的GenDR-Pix。然而，使用x8的像素洗牌可能会导致重复模式的伪影。为了减轻失真，我们提出了一种多阶段对抗蒸馏方法，逐步去除编码器和解码器。具体而言，我们利用前一阶段模型的生成特征来指导对抗性判别。此外，我们提出随机填充以增强生成特征并避免判别器崩溃。我们还引入了一种掩蔽傅里叶空间损失，以惩罚幅度的异常值。为了提高推理性能，我们经验性地将基于填充的自集成与无分类器引导相结合，以改善推理扩展。实验结果表明，与GenDR相比，GenDR-Pix实现了2.8倍的加速和60%的内存节省，且视觉质量几乎没有下降，超越了其他一步扩散SR。在所有困难面前，GenDR-Pix能够在仅1秒和6GB的情况下恢复4K图像。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.10639

VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

VideoSTF：视频大型语言模型输出重复的压力测试

Cao, Yuxin, Song, Wei, Xu, Shangzhi, Xue, Jingling, Dong, Jin Song

Abstract

Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.

Chinese Translation

视频大型语言模型（VideoLLMs）最近在视频理解任务中取得了强劲的表现。然而，我们发现了一种先前未被充分探讨的生成失败：严重的输出重复，模型陷入自我强化的重复短语或句子的循环中。这种失败模式并未被现有的VideoLLM基准所捕捉，后者主要关注任务准确性和事实正确性。我们引入了VideoSTF，这是第一个系统性测量和压力测试VideoLLMs输出重复的框架。VideoSTF通过三种互补的基于n-gram的指标形式化了重复，并提供了一个包含10,000个多样化视频的标准化测试平台，以及一个受控时间变换的库。使用VideoSTF，我们对10个先进的VideoLLMs进行了广泛的测试、时间压力测试和对抗性利用。我们发现输出重复现象普遍存在，并且对视频输入的时间扰动高度敏感。此外，我们展示了简单的时间变换可以在黑箱设置中有效诱导重复退化，暴露输出重复作为一种可利用的安全漏洞。我们的结果揭示了输出重复作为现代VideoLLMs中的一个基本稳定性问题，并促使对视频语言系统进行稳定性意识评估。我们的评估代码和脚本可在以下网址获取：https://github.com/yuxincao22/VideoSTF_benchmark。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.10659

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

多模态先验增强的文本驱动3D人机交互生成

Wang, Yin, Zhang, Ziyao, Leng, Zhiying, Liu, Haitian, Li, Frederick W. B., Li, Mu, Liang, Xiaohui

Abstract

We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

Chinese Translation

我们针对文本驱动的3D人机交互（HOI）运动生成这一具有挑战性的任务进行研究。现有方法主要依赖于直接的文本到HOI映射，但由于显著的跨模态差距，存在三个主要限制：（Q1）次优的人类运动，（Q2）不自然的物体运动，以及（Q3）人类与物体之间的交互较弱。为了解决这些挑战，我们提出了MP-HOI，这是一个基于四个核心见解的新框架：（1）多模态数据先验：我们利用来自大型多模态模型的多模态数据（文本、图像、姿态/物体）作为先验，以指导HOI生成，从而解决数据建模中的Q1和Q2。（2）增强的物体表示：我们通过结合几何关键点、接触特征和动态属性来改进现有的物体表示，使其能够表达丰富的物体特征，从而解决数据表示中的Q2。（3）多模态感知的专家混合（MoE）模型：我们提出了一种感知模态的MoE模型，用于有效的多模态特征融合，这解决了特征融合中的Q1和Q2。（4）带有交互监督的级联扩散：我们设计了一个级联扩散框架，在专门的监督下逐步细化人机交互特征，从而解决交互细化中的Q3。全面的实验表明，MP-HOI在生成高保真和细粒度的HOI运动方面优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.10660

AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

AurigaNet：一种增强城市驾驶感知的实时多任务网络

Ghasemzadeh, Kiarash, Dehghani, Sedigheh

Abstract

Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an [email protected]:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet's potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here https://github.com/KiaRational/AurigaNet.

Chinese Translation

自动驾驶汽车在减少交通事故、缓解拥堵和提升城市出行方面具有重要潜力。然而，为自主车辆开发可靠的人工智能系统仍然是一个重大挑战。在过去十年中，多任务学习作为解决驾驶感知复杂问题的有效方法逐渐崭露头角。多任务网络具有多个优势，包括提高计算效率、实时处理能力、优化资源利用和改善泛化能力。在本研究中，我们提出了AurigaNet，一种先进的多任务网络架构，旨在推动自主驾驶感知的边界。AurigaNet整合了三个关键任务：物体检测、车道检测和可行驶区域实例分割。该系统使用以多样化驾驶条件而闻名的BDD100K数据集进行训练和评估。AurigaNet的关键创新包括其端到端实例分割能力，显著提高了自动驾驶车辆路径估计的准确性和效率。实验结果表明，AurigaNet在可行驶区域分割中实现了85.2%的IoU，领先于其最接近的竞争对手0.7%。在车道检测中，AurigaNet实现了60.8%的IoU，超过其他模型30%以上。此外，该网络在交通物体检测中实现了[email protected]:0.95为47.6%，超出下一个领先模型2.9%。此外，我们通过在嵌入式设备如Jetson Orin NX上部署AurigaNet，验证了其实际可行性，展现了竞争力的实时性能。这些结果强调了AurigaNet作为自主驾驶感知系统的强大而高效解决方案的潜力。代码可在此找到：https://github.com/KiaRational/AurigaNet。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.10662

Dynamic Frequency Modulation for Controllable Text-driven Image Generation

可控文本驱动图像生成的动态频率调制

Shi, Tiandong, Zhao, Ling, Qi, Ji, Ma, Jiayi, Peng, Chengli

Abstract

The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

Chinese Translation

文本引导的扩散模型的成功确立了一种新的图像生成范式，该范式通过对文本提示的迭代优化驱动。然而，修改原始文本提示以实现预期的语义调整往往会导致意想不到的全局结构变化，从而干扰用户意图。现有方法依赖于经验特征图的选择进行干预，其性能在很大程度上依赖于适当的选择，导致稳定性不佳。本文试图从频率的角度解决上述问题，并分析噪声潜变量的频谱对生成过程中结构框架和细粒度纹理的层次性出现的影响。我们发现，低频成分主要负责在早期生成阶段建立结构框架。随着时间的推移，它们的影响减弱，转而由高频成分合成细粒度纹理。基于此，我们提出了一种无训练的频率调制方法，该方法利用具有动态衰减的频率依赖加权函数。该方法在保持结构框架一致性的同时，允许有针对性的语义修改。通过直接操控噪声潜变量，所提方法避免了对内部特征图的经验选择。大量实验表明，所提方法显著优于当前最先进的方法，在保持结构和实现语义更新之间达成了有效平衡。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.10663

AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

AMAP-APP：高效分割与荧光显微镜图像中足细胞形态计量的量化

Fatehi, Arash, Unnersjö-Jess, David, Butt, Linus, Moreau, Noémie, Benzing, Thomas, Bozek, Katarzyna

Abstract

Background: Automated podocyte foot process quantification is vital for kidney research, but the established "Automatic Morphological Analysis of Podocytes" (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP-APP, a cross-platform desktop application designed to overcome these barriers. Methods: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One-Sided T-tests (TOST). Results: AMAP-APP achieved a 147-fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r>0.90) and statistical equivalence (TOST P<0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry. By eliminating the need for high-performance computing clusters and providing a user-friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics.

Chinese Translation

背景：自动化足细胞足突量化对肾脏研究至关重要，但现有的“足细胞自动形态分析”（AMAP）方法受到高计算需求、缺乏用户界面和对Linux依赖的限制。我们开发了AMAP-APP，这是一款旨在克服这些障碍的跨平台桌面应用程序。方法：AMAP-APP通过用经典图像处理替代高强度实例分割来优化效率，同时保留原始的语义分割模型。它引入了一种改进的感兴趣区域（ROI）算法，以提高精度。验证涉及365幅小鼠和人类图像（STED和共聚焦），通过Pearson相关性和双侧T检验（TOST）对原始AMAP进行性能基准测试。结果：AMAP-APP在消费级硬件上实现了处理速度的147倍提升。形态计量输出（面积、周长、圆度和裂孔膜密度）与原始方法表现出高相关性（r>0.90）和统计等效性（TOST P<0.05）。此外，新ROI算法相比于原始方法显示出更高的准确性，减少了与手动轮廓划定的偏差。结论：AMAP-APP使基于深度学习的足细胞形态计量变得更加普及。通过消除对高性能计算集群的需求，并为Windows、macOS和Linux提供用户友好的界面，它促进了在肾脏病研究和潜在临床诊断中的广泛应用。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.10675

TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

TwiFF（未来框架思维）：一个大规模动态视觉推理数据集

Liu, Junhua, Wang, Zhangcheng, Han, Zhike, Wang, Ningli, Liang, Guotao, Kuang, Kun

Abstract

Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.

Chinese Translation

视觉思维链（Visual Chain-of-Thought, VCoT）作为一种有前景的范式，通过将视觉感知融入中间推理步骤，增强了多模态推理。然而，现有的 VCoT 方法大多局限于静态场景，难以捕捉诸如指令、预测和相机运动等任务所需的时间动态。为了解决这一问题，我们提出了 TwiFF-2.7M，这是第一个基于 $2.7$ 百万视频片段的大规模时间基础 VCoT 数据集，专门设计用于动态视觉问答。与此同时，我们引入了 TwiFF-Bench，这是一个包含 $1,078$ 个样本的高质量评估基准，评估推理轨迹的合理性和开放式动态环境中最终答案的正确性。在此基础上，我们提出了 TwiFF 模型，这是一个统一的模型，协同利用预训练的视频生成和图像理解能力，生成时间上连贯的视觉推理线索——迭代生成未来的动作帧和文本推理。大量实验表明，TwiFF 在动态推理任务上显著优于现有的 VCoT 方法和文本思维链基线，充分验证了其在动态场景下视觉问答的有效性。我们的代码和数据可在 https://github.com/LiuJunhua02/TwiFF 获取。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.10687

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

OmniVL-Guard：通过平衡强化学习实现统一的视觉-语言伪造检测与定位

Shen, Jinjie, Wu, Jing, Wang, Yaxiong, Cheng, Lechao, Tang, Shengeng, Hui, Tianrui, Pu, Nan, Zhong, Zhun

Abstract

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

Chinese Translation

现有的伪造检测方法通常局限于单模态或双模态设置，无法处理现实世界中普遍存在的交织文本、图像和视频。为了解决这一问题，本文旨在开发一个统一的框架，以实现全面的视觉-语言伪造检测与定位。在这个统一的设置中，不同模态之间的相互作用以及同时检测和定位的双重需求构成了一个关键的“困难偏差”问题：较简单的真实性分类任务往往主导梯度，导致在多任务优化过程中细粒度定位的表现不佳。为了解决这一挑战，我们提出了 extbf{OmniVL-Guard}，一个用于全面视觉-语言伪造检测与定位的平衡强化学习框架。特别地，OmniVL-Guard包含两个核心设计：自我演化的链式推理生成（Self-Evolving CoT Generation）和自适应奖励缩放策略优化（Adaptive Reward Scaling Policy Optimization, ARSPO）。自我演化的链式推理生成合成高质量的推理路径，有效克服了冷启动挑战。在此基础上，自适应奖励缩放策略优化（ARSPO）动态调节奖励规模和任务权重，确保平衡的联合优化。大量实验表明，OmniVL-Guard显著优于最先进的方法，并在跨领域场景中展现出零样本的强健泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.10698

AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

AugVLA-3D：基于深度驱动的特征增强用于视觉-语言-动作模型

Rao, Zhifeng, Chen, Wenlong, Xie, Lei, Hua, Xia, Yin, Dongfu, Tian, Zhen, Yu, F. Richard

Abstract

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

Chinese Translation

视觉-语言-动作（VLA）模型最近在机器人感知和控制方面取得了显著进展，但大多数现有方法主要依赖于使用2D图像训练的视觉-语言模型（VLM），这限制了它们在复杂3D环境中的空间理解和动作基础。为了解决这一限制，我们提出了一种新颖的框架，将深度估计集成到VLA模型中，以丰富3D特征表示。具体而言，我们采用了一种名为VGGT的深度估计基线，从标准RGB输入中提取几何感知的3D线索，从而高效利用现有的大规模2D数据集，同时隐式恢复3D结构信息。为了进一步增强这些深度派生特征的可靠性，我们引入了一个新的模块，称为动作助手，该模块通过动作先验约束学习到的3D表示，并确保其与下游控制任务的一致性。通过将增强的3D特征与传统的2D视觉标记融合，我们的方法显著提高了VLA模型的泛化能力和鲁棒性。实验结果表明，所提出的方法不仅增强了在几何模糊场景中的感知能力，还提高了动作预测的准确性。本研究突显了基于深度驱动的数据增强和辅助专家监督在弥合机器人系统中2D观察与3D感知决策之间差距的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.10704

(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

(MGS)$^2$-网：统一微几何尺度与宏几何结构以实现跨视角地理定位

Li, Minglei, He, Mengfan, Chen, Chao, Meng, Ziyang

Abstract

Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.

Chinese Translation

跨视角地理定位（CVGL）对于在缺乏全球导航卫星系统（GNSS）的情况下进行无人机导航至关重要，但在倾斜航拍视图与正射卫星参考之间存在剧烈的几何失配时，仍然表现脆弱。现有方法主要在二维流形内操作，忽视了视图依赖的垂直立面（宏结构）和尺度变化（微尺度）所导致的特征对齐严重失真。为了解决这一问题，我们提出了(MGS)$^2$，一个基于几何的框架。我们创新的核心是宏几何结构过滤（MGSF）模块。与对噪声敏感的逐像素匹配不同，MGSF利用扩张几何梯度物理地过滤掉高频立面伪影，同时增强视图不变的水平面，直接应对领域转移。为了确保这一结构过滤的稳健输入，我们明确地引入了微几何尺度适应（MGSA）模块。MGSA利用深度先验通过多分支特征融合动态校正尺度差异。此外，设计了一种几何-外观对比蒸馏（GACD）损失，以严格区分倾斜遮挡。大量实验表明，(MGS)$^2$实现了最先进的性能，在University-1652上记录了97.5\%的Recall@1，在SUES-200上为97.02\%。此外，该框架在几何模糊性方面表现出优越的跨数据集泛化能力。代码可在以下链接获取：\href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2602.10710

FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection

FGAA-FPN：面向物体检测的前景引导角度感知特征金字塔网络

Ma, Jialin

Abstract

With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.

Chinese Translation

随着高分辨率遥感和航空影像的日益普及，面向物体检测已成为地理信息更新、海洋监测和灾害响应的关键能力。然而，由于背景杂乱、尺度变化剧烈和方向变化大，这一任务仍然面临挑战。现有方法主要通过特征金字塔网络的多尺度特征融合或利用注意力机制进行上下文建模来提高性能，但往往缺乏明确的前景建模，并未充分利用几何方向先验，从而限制了特征的可区分性。为克服这些局限性，我们提出了FGAA-FPN，一种面向物体检测的前景引导角度感知特征金字塔网络。FGAA-FPN基于层次功能分解构建，考虑了金字塔层级间的不同空间分辨率和语义抽象，从而增强了多尺度表示。具体而言，前景引导特征调制模块在弱监督下学习前景显著性，以增强物体区域并抑制低级特征中的背景干扰。同时，角度感知多头注意力模块编码相对方向关系，以引导高层语义特征之间的全局交互。在DOTA v1.0和DOTA v1.5上的大量实验表明，FGAA-FPN实现了最先进的结果，分别达到了75.5%和68.3%的mAP。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2602.10720

Ecological mapping with geospatial foundation models

基于地理空间基础模型的生态映射

Mahlasi, Craig, Baloyi, Gciniwe S., Gaffoor, Zaheed, Klein, Levente, Jones, Anne, Vos, Etienne, Muszynski, Michal, Dawson, Geoffrey, Watson, Campbell

Abstract

Geospatial foundation models (GFMs) are a fast-emerging paradigm for various geospatial tasks, such as ecological mapping. However, the utility of GFMs has not been fully explored for high-value use cases. This study aims to explore the utility, challenges and opportunities associated with the application of GFMs for ecological uses. In this regard, we fine-tune several pretrained AI models, namely, Prithvi-E0-2.0 and TerraMind, across three use cases, and compare this with a baseline ResNet-101 model. Firstly, we demonstrate TerraMind's LULC generation capabilities. Lastly, we explore the utility of the GFMs in forest functional trait mapping and peatlands detection. In all experiments, the GFMs outperform the baseline ResNet models. In general TerraMind marginally outperforms Prithvi. However, with additional modalities TerraMind significantly outperforms the baseline ResNet and Prithvi models. Nonetheless, consideration should be given to the divergence of input data from pretrained modalities. We note that these models would benefit from higher resolution and more accurate labels, especially for use cases where pixel-level dynamics need to be mapped.

Chinese Translation

地理空间基础模型（GFMs）是一个快速发展的范式，适用于各种地理空间任务，如生态映射。然而，GFMs在高价值应用场景中的实用性尚未得到充分探讨。本研究旨在探索GFMs在生态应用中的实用性、挑战和机遇。在这方面，我们对多个预训练的人工智能模型进行了微调，即Prithvi-E0-2.0和TerraMind，并在三个应用案例中进行比较，同时与基线模型ResNet-101进行对比。首先，我们展示了TerraMind在土地利用/覆盖（LULC）生成方面的能力。最后，我们探讨了GFMs在森林功能性状映射和泥炭地检测中的应用价值。在所有实验中，GFMs的表现均优于基线ResNet模型。总体而言，TerraMind的表现略优于Prithvi。然而，随着额外模态的引入，TerraMind的表现显著优于基线ResNet和Prithvi模型。尽管如此，仍需考虑输入数据与预训练模态之间的差异。我们注意到，这些模型将受益于更高的分辨率和更准确的标签，特别是在需要进行像素级动态映射的应用场景中。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2602.10722

A Diffusion-Based Generative Prior Approach to Sparse-view Computed Tomography

基于扩散的生成先验方法在稀疏视角计算机断层扫描中的应用

Evangelista, Davide, Cascarano, Pasquale, Piccolomini, Elena Loli

Abstract

The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.

Chinese Translation

从稀疏或有限角度几何形状重建X射线CT图像是一项极具挑战性的任务。数据的缺乏通常会导致重建图像中出现伪影，甚至可能导致物体变形。因此，在这一背景下使用深度生成模型引起了极大的兴趣，并具有潜在的成功可能性。在深度生成先验（Deep Generative Prior, DGP）框架中，结合基于扩散的生成模型与迭代优化算法，用于从在稀疏几何下获取的正弦图重建CT图像，以保持基于模型的方法的可解释性，同时引入神经网络的生成能力。因此，在这些框架内还有多个方面可以进一步研究，以提高重建质量，例如图像生成、模型以及用于解决最小化问题的迭代算法，我们针对现有方法提出了修改建议。尽管在高度稀疏的几何条件下获得的结果非常有希望，但显然在这一方向上仍需进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2602.10728

OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility

OccFace：统一的考虑遮挡的人脸关键点检测框架，具备逐点可见性

Xiang, Xinhao, Li, Zhengxin, Dhakad, Saurav, Bancroft, Theo, Zhang, Jiawei, Li, Weiyang

Abstract

Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, [email protected], and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.

Chinese Translation

在遮挡情况下，准确的人脸关键点检测仍然具有挑战性，尤其是对于外观变化大和因旋转引起自遮挡的人脸。现有的检测器通常在隐式处理遮挡的同时定位关键点，而未能预测逐点可见性，这对下游应用是有益的。我们提出了OccFace，一个针对通用人类面孔的遮挡感知框架，包括人类、风格化角色和其他非人类设计。OccFace采用统一的密集100点布局和基于热图的主干网络，并增加了一个遮挡模块，该模块通过结合局部证据与跨关键点上下文，联合预测关键点坐标和逐点可见性。可见性监督将手动标签与基于关键点的掩膜混合，从掩膜热图重叠中推导出伪可见性。我们还创建了一个遮挡感知评估套件，报告可见与遮挡关键点的NME，并使用Occ AP、[email protected]和ROC-AUC对可见性进行基准测试，同时提供一个标注有100点关键点和逐点可见性的数据集。实验表明，在外部遮挡和大幅头部旋转下，模型的鲁棒性得到了改善，尤其是在遮挡区域，同时保持了对可见关键点的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2602.10744

Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning

基于无内容多模型导向表示学习的自监督图像超分辨率质量评估

Majlessi, Kian, Soltani, Amir Masoud, Mahdavi, Mohammad Ebrahim, Gourrier, Aurelien, Adibi, Peyman

Abstract

Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.

Chinese Translation

应用于现实世界低分辨率（LR）图像的超分辨率（SR）通常会导致复杂且不规则的退化，这源于自然场景获取的固有复杂性。与在明确场景下创建的合成LR图像所产生的SR伪影相比，这些失真是高度不可预测的，并且在不同的现实生活环境中显著变化。因此，评估从现实LR获得的SR图像（SR-IQA）的质量仍然是一个具有挑战性且未得到充分研究的问题。在本研究中，我们提出了一种针对这种高度不适定现实环境的无参考SR-IQA方法。所提出的方法使得在数据稀缺领域的现实SR应用中实现领域自适应的图像质量评估（IQA）。我们假设超分辨率图像中的退化在很大程度上依赖于底层SR算法，而不仅仅由图像内容决定。为此，我们引入了一种自监督学习（SSL）策略，该策略首先在预训练阶段预训练多个SR模型导向的表示。我们的对比学习框架从由同一SR模型生成的图像中形成正样本对，从由不同方法生成的图像中形成负样本对，这与图像内容无关。所提出的方法S3 RIQA进一步结合了针对性预处理，以提取互补的质量信息，并引入辅助任务以更好地处理与不同SR缩放因子相关的各种退化特征。为此，我们构建了一个新的数据集SRMORSS，以支持无监督的预训练；该数据集包括应用于众多真实LR图像的广泛SR算法，填补了现有数据集的空白。在真实SR-IQA基准上的实验表明，S3 RIQA在大多数相关的最先进指标中始终表现优于其他方法。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2602.10745

Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data

用于高光谱数据回归的光谱-空间对比学习框架

Dhaini, Mohamad, Honeine, Paul, Berar, Maxime, Van Exem, Antonin

Abstract

Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

Chinese Translation

对比学习在表示学习中取得了巨大的成功，尤其是在图像分类任务中。然而，针对回归任务的研究仍然较为匮乏，特别是在高光谱数据的应用方面。本文提出了一种用于高光谱数据回归任务的光谱-空间对比学习框架，该框架采用模型无关的设计，能够增强如3D卷积网络和基于变换器（transformer）的网络等主干网络。此外，我们提供了一系列与增强高光谱数据相关的变换方法。在合成和真实数据集上的实验表明，所提出的框架和变换显著提高了所有研究的主干模型的性能。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2602.10757

Text-to-Vector Conversion for Residential Plan Design

住宅平面设计的文本到向量转换

Bazhenov, Egor, Kasai, Stepan, Shalamov, Viacheslav, Efimova, Valeria

Abstract

Computer graphics, comprising both raster and vector components, is a fundamental part of modern science, industry, and digital communication. While raster graphics offer ease of use, its pixel-based structure limits scalability. Vector graphics, defined by mathematical primitives, provides scalability without quality loss, however, it is more complex to produce. For design and architecture, the versatility of vector graphics is paramount, despite its computational demands. This paper introduces a novel method for generating vector residential plans from textual descriptions. Our approach surpasses existing solutions by approximately 5% in CLIPScore-based visual quality, benefiting from its inherent handling of right angles and flexible settings. Additionally, we present a new algorithm for vectorizing raster plans into structured vector images. Such images have a better CLIPscore compared to others by about 4%.

Chinese Translation

计算机图形学，包括光栅和矢量组件，是现代科学、工业和数字通信的基础部分。尽管光栅图形易于使用，但其基于像素的结构限制了可扩展性。矢量图形由数学原语定义，提供了无质量损失的可扩展性，然而，其生成过程更为复杂。在设计和建筑领域，矢量图形的多功能性至关重要，尽管其计算需求较高。本文介绍了一种从文本描述生成矢量住宅平面的新方法。我们的方案在基于 CLIPScore 的视觉质量上超越了现有解决方案约 5%，得益于其固有的直角处理能力和灵活的设置。此外，我们还提出了一种新的算法，用于将光栅平面向量化为结构化的矢量图像。这些图像的 CLIPscore 比其他图像高出约 4%。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2602.10764

Dual-End Consistency Model

双端一致性模型

Dong, Linwei, Guo, Ruoyu, Bai, Ge, Yuan, Zehuan, Luo, Yawei, Zou, Changqing

Abstract

The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

Chinese Translation

缓慢的迭代采样特性仍然是扩散和基于流的生成模型实际部署的主要瓶颈。尽管一致性模型（Consistency Models, CMs）代表了一种基于蒸馏的高效生成的最先进方法，但其大规模应用仍然受到两个关键问题的限制：训练不稳定性和采样不灵活性。现有方法试图通过架构调整或正则化目标来缓解这些问题，但忽视了对轨迹选择的关键依赖。在本研究中，我们首先分析了这两个限制：训练不稳定性源于不稳定的自监督项引起的损失发散，而采样不灵活性则源于误差积累。基于这些见解和分析，我们提出了双端一致性模型（Dual-End Consistency Model, DE-CM），该模型选择重要的子轨迹簇以实现稳定和有效的训练。DE-CM对PF-ODE轨迹进行分解，并选择三个关键的子轨迹作为优化目标。具体而言，我们的方法利用连续时间CMs目标实现少步蒸馏，并利用流匹配作为边界正则化器以稳定训练过程。此外，我们提出了一种新颖的噪声到噪声（Noise-to-Noisy, N2N）映射，可以将噪声映射到任何点，从而缓解第一步中的误差积累。大量实验结果表明我们方法的有效性：在ImageNet 256x256数据集上，它在一步生成中达到了1.70的最先进FID得分，超越了现有基于CM的一步方法。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2602.10771

From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

从驾驶到骑行：自主驾驶视觉语言模型是否能够推广到骑行者辅助的空间感知与规划？

Nakka, Krishna Kanth, Nakka, Vedasri

Abstract

Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

Chinese Translation

骑行者在城市交通中经常遇到安全关键的情况，这突显了需要辅助系统来支持安全和知情的决策。最近，视觉语言模型（VLMs）在自主驾驶基准测试中表现出色，表明它们在一般交通理解和导航相关推理方面的潜力。然而，现有的评估主要以车辆为中心，未能从骑行者的视角评估感知和推理。为了解决这一问题，我们引入了CyclingVQA，这是一个旨在从骑行者的角度探测感知、时空理解和交通规则与车道推理的诊断基准。对31个以上的近期VLM进行评估，这些模型涵盖了通用型、空间增强型和专门针对自主驾驶的模型，我们发现当前模型表现出令人鼓舞的能力，同时也揭示了在骑行者中心的感知和推理方面明显的改进空间，特别是在解释骑行者特定的交通提示和将标志与正确的导航车道关联方面。值得注意的是，几个专门针对驾驶的模型在性能上不及强大的通用VLM，表明从车辆中心训练到骑行者辅助场景的迁移有限。最后，通过系统的错误分析，我们识别出重复的失败模式，以指导更有效的骑行者辅助智能系统的开发。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2602.10799

RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

RSHallu：针对遥感多模态大语言模型的双模式幻觉评估与领域定制缓解

Zhou, Zihui, Feng, Yong, Chen, Yanying, Duan, Guofan, Song, Zhenxi, Zhou, Mingliang, Jia, Weijia

Abstract

Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

Chinese Translation

多模态大语言模型（MLLMs）在遥感（RS）领域的应用日益广泛，并在遥感视觉基础（RSVG）、遥感视觉问答（RSVQA）和多模态对话等任务中表现出色。然而，幻觉现象，即与输入的遥感图像不一致的响应，严重阻碍了它们在高风险场景（如紧急管理和农业监测）中的应用，并在遥感领域尚未得到充分探索。在本研究中，我们提出了RSHallu，这是一个系统性的研究，包含三个主要成果：（1）我们通过一个面向遥感的分类法对遥感幻觉进行了形式化定义，并引入了图像级幻觉，以捕捉超越对象中心错误（如模态、分辨率和场景级语义）的遥感特定不一致性；（2）我们构建了一个幻觉基准RSHalluEval（2,023个问答对），并实现了双模式检查，支持高精度的云审计和通过在RSHalluCheck数据集（15,396个问答对）上微调的紧凑检查器进行低成本的可重复本地检查；（3）我们引入了一个领域定制的数据集RSHalluShield（30,000个问答对），以便于训练友好的缓解，并进一步提出了无训练的即插即用策略，包括解码时的logit修正和遥感感知提示。在代表性的遥感多模态大语言模型中，我们的缓解措施在统一协议下将无幻觉率提高了多达21.63个百分点，同时在下游遥感任务（RSVQA/RSVG）中保持了竞争力的表现。代码和数据集将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2602.10806

DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

DMP-3DAD：通过真实深度图投影进行跨类别3D异常检测，使用少量正常样本

Wang, Zi, Hotta, Katsuya, Kamide, Koichiro, Zou, Yawen, Qin, Jianjian, Zhang, Chao, Yu, Jun

Abstract

Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

Chinese Translation

跨类别异常检测针对3D点云的目标是仅使用少量正常样本来判断一个未见物体是否属于目标类别。现有的大多数方法依赖于特定类别的训练，这限制了它们在少样本场景中的灵活性。本文提出了DMP-3DAD，一个基于多视角真实深度图投影的无训练框架，用于跨类别3D异常检测。具体而言，通过将点云转换为固定集的真实深度图像，我们的方法利用冻结的CLIP视觉编码器提取多视角表示，并通过加权特征相似性进行异常检测，这不需要任何微调或类别依赖的适应。对ShapeNetPart数据集的广泛实验表明，DMP-3DAD在少样本设置下实现了最先进的性能。结果表明，所提出的方法为实际的跨类别3D异常检测提供了一个简单而有效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2602.10809

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

DeepImageSearch：基于上下文感知的视觉历史图像检索的多模态智能体基准测试

Deng, Chenlong, Deng, Mengjie, Wu, Junjie, Zeng, Dun, Wang, Teng, Xie, Qingsong, Huang, Jiadeng, Ma, Shengjie, Zhang, Changwang, Wang, Zhaoxiang, Wang, Jun, Zhu, Yutao, Dou, Zhicheng

Abstract

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

Chinese Translation

现有的多模态检索系统在语义匹配方面表现出色，但隐含地假设查询图像的相关性可以孤立地进行测量。这一范式忽视了现实视觉流中固有的丰富依赖关系，信息分布在时间序列中，而不是局限于单一快照。为了解决这一问题，我们提出了DeepImageSearch，这是一种新的智能体范式，将图像检索重新定义为自主探索任务。模型必须对原始视觉历史进行多步推理和规划，以根据隐含的上下文线索定位目标。我们构建了DISBench，这是一个基于互联视觉数据的挑战性基准。为了解决创建依赖上下文的查询的可扩展性挑战，我们提出了一种人机协作管道，利用视觉-语言模型挖掘潜在的时空关联，有效地在人工验证之前卸载密集的上下文发现。此外，我们使用配备精细工具和双重记忆系统的模块化智能体框架构建了一个稳健的基线，以支持长时间导航。大量实验表明，DISBench对最先进的模型提出了重大挑战，突显了将智能体推理纳入下一代检索系统的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2602.10815

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

为什么强化学习（RL）的泛化能力优于监督微调（SFT）？基于数据的视觉-语言模型（VLM）后训练视角

Lu, Aojun, Feng, Tao, Yuan, Hangjie, Li, Wei, Sun, Yanan

Abstract

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

Chinese Translation

通过后训练对大规模视觉-语言模型（VLM）的适应揭示了显著的泛化差距：使用强化学习（RL）微调的模型在分布外（OOD）性能上始终优于使用监督微调（SFT）训练的模型。本文提出了一种基于数据的解释，认为RL的泛化优势源于一种隐含的数据过滤机制，该机制本质上优先考虑中等难度的训练样本。为了验证这一假设，我们系统地评估了不同难度水平训练数据集上SFT模型的OOD泛化能力。我们的结果确认数据难度是一个关键因素，表明在困难样本上训练显著降低了OOD性能。基于这一发现，我们引入了困难样本策划的SFT（DC-SFT），这是一种明确根据样本难度过滤训练集的简单方法。实验表明，DC-SFT不仅显著提升了OOD泛化能力，超越了标准SFT的表现，还超过了基于RL的训练性能，同时提供了更大的稳定性和计算效率。该研究为VLM中的OOD泛化差距提供了基于数据的解释，并建立了一条更高效的实现稳健泛化的路径。代码可在 https://github.com/byyx666/DC-SFT 获取。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2602.10818

Resource-Efficient RGB-Only Action Recognition for Edge Deployment

面向边缘部署的资源高效RGB-only动作识别

Yoon, Dongsik, Kim, Jongeun, Lee, Dayeon

Abstract

Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

Chinese Translation

在边缘设备上进行动作识别面临着延迟、内存、存储和功耗等严格的限制。虽然诸如骨骼和深度信息等辅助模态可以提升识别性能，但它们通常需要额外的传感器或计算成本高昂的姿态估计流程，从而限制了在边缘环境中的实际应用。在本研究中，我们提出了一种紧凑的RGB-only网络，旨在实现高效的设备端推理。我们的方法基于增强了时间位移（Temporal Shift）的X3D风格主干网络，并进一步引入了选择性时间适应和无参数注意力机制。在NTU RGB+D 60和120基准测试上的大量实验表明了良好的准确性与效率平衡。此外，在Jetson Orin Nano上的部署级分析验证了与现有RGB-based动作识别技术相比，具有更小的设备端占用和实际资源利用率。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2602.10825

Flow caching for autoregressive video generation

自回归视频生成的流缓存

Ma, Yuexiao, Zheng, Xuzhe, Xu, Jing, Xu, Xiwei, Ling, Feng, Zheng, Xiawu, Kuang, Huafeng, Li, Huixia, Wang, Xing, Xiao, Xuefeng, Chao, Fei, Ji, Rongrong

Abstract

Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.

Chinese Translation

自回归模型通常基于Transformer架构，代表了一种强大的范式，通过顺序合成内容生成超长视频。然而，这一顺序生成过程 notoriously 缓慢。尽管缓存策略已被证明对加速传统视频扩散模型有效，但现有方法假设所有帧的去噪均匀——这一假设在自回归模型中失效，因为不同的视频块在相同时间步展现出不同的相似性模式。本文提出了FlowCache，这是第一个专门为自回归视频生成设计的缓存框架。我们的关键见解是，每个视频块应保持独立的缓存策略，从而在每个时间步对哪些块需要重新计算进行细粒度控制。我们引入了一种块级缓存策略，动态适应每个块独特的去噪特性，并辅以联合重要性-冗余优化的KV缓存压缩机制，保持固定的内存界限，同时保留生成质量。我们的方法在MAGI-1上实现了2.38倍的显著加速，在SkyReels-V2上实现了6.7倍的加速，且质量下降微乎其微（VBench：分别增加0.87和减少0.79）。这些结果表明，FlowCache成功释放了自回归模型在实时超长视频生成中的潜力，为大规模高效视频合成建立了新的基准。代码可在https://github.com/mikeallen39/FlowCache获取。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2602.10858

Hyperspectral Smoke Segmentation via Mixture of Prototypes

基于原型混合的高光谱烟雾分割

Yao, Lujian, Zhao, Haitao, Kong, Xianghai, Xu, Yuhan

Abstract

Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

Chinese Translation

烟雾分割对于野火管理和工业安全应用至关重要。传统的基于可见光的方法由于光谱信息不足而面临局限，尤其是在云干扰和半透明烟雾区域的处理上存在困难。为了解决这些挑战，我们引入高光谱成像用于烟雾分割，并首次提出高光谱烟雾分割数据集（HSSDataset），该数据集包含从20个真实场景中收集的超过18,000帧的精心标注样本，采用多对一标注协议。然而，不同的光谱波段在空间区域中表现出不同的区分能力，因此需要自适应波段加权策略。我们将其分解为三个技术挑战：光谱交互污染、有限的光谱模式建模和复杂的加权路由问题。我们提出了一种原型混合（Mixture of Prototypes, MoP）网络，具有：（1）用于光谱隔离的波段分离，（2）基于原型的光谱表示以适应多样化模式，以及（3）用于自适应空间感知波段加权的双层路由器。我们进一步构建了一个多光谱数据集（MSSDataset），包含RGB-红外图像。大量实验验证了在高光谱和多光谱模式下的优越性能，为基于光谱的烟雾分割建立了新的范式。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2602.10875

Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

Stride-Net：面向公平性的解耦表示学习在胸部X光诊断中的应用

Rashid, Darakshan, Imam, Raza, Mahapatra, Dwarikanath, Lall, Brejesh

Abstract

Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at https://github.com/Daraksh/Fairness_StrideNet.

Chinese Translation

用于胸部X光分类的深度神经网络在平均性能上表现优异，但在特定人口子群体中往往表现不佳，这引发了对临床安全性和公平性的严重关注。现有的去偏见方法通常在不同数据集上产生不一致的改进，或者通过降低整体诊断效用来实现公平，将公平视为事后约束，而非学习表示的固有属性。在本研究中，我们提出了Stride-Net（通过解耦和可学习掩模与嵌入对齐的敏感属性韧性学习），这是一个面向公平性的框架，旨在为胸部X光分析学习疾病区分性且与人口统计无关的表示。Stride-Net在补丁级别操作，使用可学习的基于步幅的掩模选择与标签对齐的图像区域，同时通过对抗混淆损失抑制敏感属性信息。为了将表示锚定在临床语义中并抑制捷径学习，我们进一步通过群体最优传输强制图像特征与基于BioBERT的疾病标签嵌入之间的语义对齐。我们在MIMIC-CXR和CheXpert基准上评估了Stride-Net，涵盖种族和交叉种族-性别子群体。在包括ResNet和视觉变换器在内的多种架构中，Stride-Net始终改善了公平性指标，同时匹配或超越了基线准确性，实现了比以往去偏见方法更有利的准确性-公平性权衡。我们的代码可在https://github.com/Daraksh/Fairness_StrideNet获取。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2602.10880

Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

图表规范：激励 VLM 推理的结构化表示在图表到代码生成中的应用

He, Minggui, Dai, Mingchen, Zhang, Jian, Liu, Yilun, Tao, Shimin, Zeng, Pufan, Yoshie, Osamu, Ieiri, Yuya

Abstract

Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper

Chinese Translation

视觉-语言模型（VLMs）在从图表图像生成绘图代码方面展现了良好的前景，但实现结构保真性仍然具有挑战性。现有方法主要依赖于监督微调，促使表面层次的标记模仿，而非忠实建模底层图表结构，这常常导致幻觉或语义不一致的输出。我们提出了图表规范（Chart Specification），一种结构化的中间表示，旨在将训练从文本模仿转向语义基础的监督。图表规范过滤语法噪声，以构建结构平衡的训练集，并支持 Spec-Align 奖励，提供关于结构正确性的细粒度、可验证反馈，从而使强化学习能够强制执行一致的绘图逻辑。在三个公共基准上的实验表明，我们的方法始终优于先前的方法。在仅使用 3000 个训练样本的情况下，我们实现了强大的数据效率，在复杂基准上超越领先基线高达 61.7%，并且在 4000 个样本的情况下，在所有评估指标上建立了新的最先进结果。总体而言，我们的结果表明，精确的结构监督为高保真图表到代码生成提供了一条高效的途径。代码和数据集可在以下链接获取：https://github.com/Mighten/chart-specification-paper

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2602.10884

ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

ResWorld：用于端到端自动驾驶的时间残差世界模型

Zhang, Jinqing, Fu, Zehua, Xu, Zelin, Dai, Wenying, Liu, Qingjie, Wang, Yunhong

Abstract

The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.

Chinese Translation

世界模型对驾驶场景的全面理解能力显著提高了端到端自动驾驶框架的规划准确性。然而，静态区域的冗余建模以及与轨迹缺乏深度交互，阻碍了世界模型充分发挥其效能。本文提出了时间残差世界模型（Temporal Residual World Model, TR-World），重点关注动态物体建模。通过计算场景表示的时间残差，可以在不依赖于检测和跟踪的情况下提取动态物体的信息。TR-World仅将时间残差作为输入，从而更精确地预测动态物体的未来空间分布。通过将预测与当前鸟瞰视图（Bird's Eye View, BEV）特征中包含的静态物体信息相结合，可以获得准确的未来BEV特征。此外，我们提出了未来引导轨迹精炼（Future-Guided Trajectory Refinement, FGTR）模块，该模块在先前轨迹（从当前场景表示预测）和未来BEV特征之间进行交互。该模块不仅可以利用未来道路条件来精炼轨迹，还为未来BEV特征提供稀疏的时空监督，以防止世界模型崩溃。在nuScenes和NAVSIM数据集上进行的综合实验表明，我们的方法，即ResWorld，达到了最先进的规划性能。代码可在https://github.com/mengtan00/ResWorld.git获取。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2602.10940

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

FastUSP：一种用于分布式扩散模型推理的多级协同加速框架

Li, Guandong

Abstract

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

Chinese Translation

大规模扩散模型如FLUX（120亿参数）和Stable Diffusion 3（80亿参数）需要多GPU并行处理以实现高效推理。统一序列并行性（Unified Sequence Parallelism, USP）结合了Ulysses和Ring注意机制，已成为分布式注意计算的最先进方法。然而，现有的USP实现存在显著的低效问题，包括过高的内核启动开销和次优的计算-通信调度。本文提出了 extbf{FastUSP}，一个多级优化框架，集成了编译级优化（使用CUDA图的图编译和计算-通信重排序）、通信级优化（FP8量化的集体通信）和操作级优化（带双缓冲的流水线Ring注意）。我们在FLUX（120亿）和Qwen-Image模型上评估了FastUSP，使用2、4和8个NVIDIA RTX 5090 GPU。在FLUX上，FastUSP实现了相对于基线USP的一致性 extbf{1.12$ imes$--1.16$ imes$}的端到端加速，其中编译级优化贡献了主要的改进。在Qwen-Image上，FastUSP在2个GPU上实现了 extbf{1.09$ imes$}的加速；在4到8个GPU上，我们发现了一个与Ring注意机制的PyTorch Inductor兼容性限制，阻止了编译优化，而基线USP的性能则扩展到2-GPU性能的1.30$ imes$--1.46$ imes$。我们进一步提供了分布式扩散推理性能特征的详细分析，揭示了内核启动开销而非通信延迟是现代高带宽GPU互连的主要瓶颈。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2602.10943

Towards Learning a Generalizable 3D Scene Representation from 2D Observations

从2D观察中学习可泛化的3D场景表示

Gromniak, Martin, Habekost, Jan-Gerrit, Kamp, Sebastian, Magg, Sven, Wermter, Stefan

Abstract

We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

Chinese Translation

我们提出了一种可泛化的神经辐射场（Generalizable Neural Radiance Field）方法，用于从自我中心的机器人观察中预测3D工作空间的占用情况。与之前在相机中心坐标下操作的方法不同，我们的模型在全局工作空间框架中构建占用表示，使其能够直接应用于机器人操作。该模型整合了灵活的源视图，并且能够在没有场景特定微调的情况下对未见过的物体排列进行泛化。我们在一个类人机器人上演示了该方法，并将预测的几何形状与3D传感器的真实值进行了评估。在40个真实场景上训练后，我们的模型在重建误差上达到了26毫米，包括被遮挡区域，验证了其在超越传统立体视觉方法的情况下推断完整3D占用的能力。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2602.10967

Healthy Harvests: A Comparative Look at Guava Disease Classification Using InceptionV3

健康收成：基于 InceptionV3 的番石榴疾病分类比较研究

Ghosh, Samanta, Anika, Shaila Afroz, Ahmed, Umma Habiba, Alam, B. M. Shahria, Noor, Mohammad Tahmid, Niloy, Nishat Tasnim

Abstract

Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model's robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan

Chinese Translation

番石榴果实常常受到多种疾病的影响，这会损害果实质量和产量。早期识别对于减少损害和确保果实健康至关重要。本研究聚焦于三种不同类别的疾病分类，分别为炭疽病、果蝇和健康果实。研究中使用的数据集来自 Mendeley Data，包含 473 张原始番石榴图像，这些图像在大小和格式上各不相同。原始数据集被调整为 256x256 像素的 RGB 颜色模式，以提高一致性。随后，应用数据增强过程，通过生成原始图像的变体来改善数据集。增强后的数据集包含 3784 张图像，采用了先进的预处理技术。实现了两个深度学习模型来对图像进行分类。InceptionV3 模型因其先进的框架而广为人知，能够有效地应用多个卷积滤波器以获取不同特征。另一方面，ResNet50 模型通过使用残差学习来帮助训练更深的网络。InceptionV3 模型达到了 98.15% 的令人印象深刻的准确率，而 ResNet50 的准确率为 94.46%。应用了 CutMix 和 MixUp 等数据混合方法来增强模型的鲁棒性。混淆矩阵用于评估 InceptionV3 和 ResNet50 的整体模型性能。此外，使用 SHAP 分析来提高可解释性，帮助找到对模型预测重要的图像部分。本研究旨在强调先进模型如何增强

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2602.10978

VFGS-Net: Frequency-Guided State-Space Learning for Topology-Preserving Retinal Vessel Segmentation

VFGS-Net：基于频率引导的状态空间学习用于拓扑保持的视网膜血管分割

Song, Ruiqi, Liu, Lei, Zhang, Ya-Nan, Wang, Chao, Li, Xiaoning, Mu, Nan

Abstract

Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.

Chinese Translation

准确的视网膜血管分割是视网膜图像定量分析和计算机辅助诊断血管疾病（如糖尿病视网膜病变）的关键前提。然而，视网膜血管的细长形态、广泛的尺度变化和低对比度给现有方法带来了重大挑战，使得同时保持细小毛细血管和维持全局拓扑连续性变得困难。为了解决这些挑战，我们提出了血管感知频域和全局空间建模网络（VFGS-Net），这是一个端到端的分割框架，能够在统一架构中无缝集成频率感知特征增强、双路径卷积表示学习和双向不对称空间状态空间建模。具体而言，VFGS-Net采用双路径特征卷积模块共同捕捉细粒度的局部纹理和多尺度的上下文语义。我们引入了一种新颖的血管感知频域通道注意机制，以自适应地重新加权光谱分量，从而增强高层特征中与血管相关的响应。此外，在网络瓶颈处，我们提出了一种基于双向不对称Mamba2的空间建模模块，以高效捕捉长距离空间依赖关系并增强血管结构的全局连续性。在四个公开可用的视网膜血管数据集上的广泛实验表明，VFGS-Net在性能上与最先进的方法相比具有竞争力或优越性。值得注意的是，我们的模型在细小血管、复杂分支模式和低对比度区域的分割准确性上始终有所提高，突显了其稳健性和临床潜力。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2602.10985

DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification

DFIC：朝着平衡的面部图像数据集以实现自动ICAO合规性验证

Gonçalves, Nuno, Nunes, Diogo, Guerra, Carla, Marcos, João

Abstract

Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (https://github.com/visteam-isr-uc/DFIC) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems.

Chinese Translation

确保机器可读旅行证件（MRTDs）中面部图像符合ISO/IEC和ICAO标准对于可靠的身份验证至关重要，但当前的人工检查方法在高需求环境中效率低下。本文介绍了DFIC数据集，这是一个新颖的综合面部图像数据集，包含约58,000张标注图像和2706段超过1000个受试者的视频，涵盖了广泛的非合规条件，以及合规肖像。我们的数据集提供了比现有公共数据集更为平衡的人口分布，其中一个分区几乎均匀分布，促进了自动ICAO合规性验证方法的开发。利用DFIC，我们微调了一种新方法，该方法在自动验证ICAO合规性要求时严重依赖空间注意机制，并与当前最先进的ICAO合规性验证方法进行了比较，显示出改进的结果。DFIC数据集现已公开（https://github.com/visteam-isr-uc/DFIC），供新模型的训练和验证，提供了前所未有的面孔多样性，将提高对验证系统所呈现的面孔和道具的内在多样组合的鲁棒性和适应性。这些结果强调了DFIC在增强自动ICAO合规性方法方面的潜力，但它也可以用于许多其他旨在提高面部识别系统的安全性、隐私性和公平性的应用。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2602.10994

Interpretable Vision Transformers in Image Classification via SVDA

通过 SVDA 实现图像分类中的可解释视觉变换器

Arampatzakis, Vasileios, Pavlidis, George, Mitianoudis, Nikolaos, Papamarkos, Nikos

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

Chinese Translation

视觉变换器（ViTs）在图像分类中取得了最先进的性能，但其注意机制往往不够透明，并表现出密集且非结构化的行为。在本研究中，我们将之前提出的基于奇异值分解的注意力机制（SVD-Inspired Attention, SVDA）适配到 ViT 架构中，引入了一种几何基础的公式，增强了可解释性、稀疏性和谱结构。我们应用可解释性指标——最初与 SVDA 一起提出——来监测训练过程中的注意力动态，并评估学习到的表示的结构特性。在四个广泛使用的基准数据集（CIFAR-10、FashionMNIST、CIFAR-100 和 ImageNet-100）上的实验评估表明，SVDA 一直能够产生更可解释的注意力模式，而不牺牲分类准确性。尽管当前框架提供的是描述性见解而非处方性指导，但我们的结果确立了 SVDA 作为分析和开发计算机视觉中结构化注意力模型的全面且信息丰富的工具的地位。本研究为未来在可解释人工智能、谱诊断和基于注意力的模型压缩方面的进展奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2602.11004

Enhancing Predictability of Multi-Tenant DNN Inference for Autonomous Vehicles' Perception

增强多租户深度神经网络推理在自动驾驶车辆感知中的可预测性

Liu, Liangkai, Shin, Kang G., Lee, Jinkyu, Yang, Chengmo, Shi, Weisong

Abstract

Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV's perception pipeline is challenging due to the large gap between the computation requirement and the AV's limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV's surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by >2.6x and fusion-delay variations by >2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.

Chinese Translation

自动驾驶车辆（AVs）依赖传感器和深度神经网络（DNNs）实时感知周围环境并做出操控决策。然而，由于计算需求与自动驾驶车辆有限资源之间存在巨大差距，实现实时DNN推理在自动驾驶车辆的感知管道中面临挑战。现有研究大多集中于通过剪枝和量化压缩DNN模型来优化DNN推理时间，以实现更快的感知。相比之下，我们提出了一种基于DNN的可预测感知系统（PP-DNN），通过动态选择关键帧和感兴趣区域（ROIs），在保持多租户DNN相同准确度的同时减少待处理的图像数据量。PP-DNN基于我们的关键洞察，即自动驾驶车辆的关键帧和ROIs会随着周围环境的变化而变化。然而，在多租户DNN中识别和使用关键帧及ROIs以实现可预测推理是具有挑战性的。在给定的图像帧流中，PP-DNN利用ROIs生成器根据连续帧和交通场景的相似性来识别关键帧和ROIs。然后，PP-DNN利用FLOPs预测器预测来自动态关键帧和ROIs的乘加操作（MACs）。ROIs调度器协调多个DNN模型对关键帧和ROIs的处理。最后，我们设计了一个检测预测器用于非关键帧的感知。我们在基于ROS的自动驾驶车辆管道中实现了PP-DNN，并使用BDD100K和nuScenes数据集进行了评估。PP-DNN显著增强了感知的可预测性，融合帧数量最多提高了7.3倍，融合延迟减少超过2.6倍，融合延迟变化减少超过2.3倍，检测完整性提高了75.4%，成本效益提高了高达98%相较于基线。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2602.11005

Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

通过SVDA实现单目深度估计中的可解释视觉变换器

Arampatzakis, Vasileios, Pavlidis, George, Mitianoudis, Nikolaos, Papamarkos, Nikos

Abstract

Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.

Chinese Translation

单目深度估计是计算机视觉中的一个核心问题，广泛应用于机器人技术、增强现实（AR）和自动驾驶等领域。然而，驱动现代变换器架构的自注意力机制仍然不够透明。我们将受奇异值分解（SVD）启发的注意力（SVD-Inspired Attention, SVDA）引入到密集预测变换器（Dense Prediction Transformer, DPT）中，提供了密集预测任务中注意力的首个谱结构化公式。SVDA通过将可学习的对角矩阵嵌入到归一化的查询-键交互中，将方向对齐与谱调制解耦，从而实现内在可解释的注意力图，而非事后近似。对KITTI和NYU-v2数据集的实验表明，SVDA在保持或略微提高预测准确性的同时，仅增加了少量计算开销。更重要的是，SVDA解锁了六个谱指标，量化了熵、秩、稀疏性、对齐性、选择性和鲁棒性。这些指标揭示了在训练过程中注意力组织的跨数据集和深度方向的一致模式，这些见解在标准变换器中是无法获得的。通过将注意力的角色从不透明机制转变为可量化描述符，SVDA重新定义了单目深度估计中的可解释性，并为透明的密集预测模型开辟了一条原则性途径。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2602.11007

LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

LaSSM：通过局部聚合和状态空间模型实现高效的语义-空间查询解码用于3D实例分割

Yao, Lei, Wang, Yi, Cui, Yawen, Liu, Moyun, Chau, Lap-Pui

Abstract

Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

Chinese Translation

基于查询的3D场景实例分割从点云中取得了显著的性能。然而，现有方法由于点云的稀疏特性而面临查询初始化困境，并依赖于计算密集型的注意力机制在查询解码器中。因此，我们提出了LaSSM，优先考虑简单性和效率，同时保持竞争力的性能。具体而言，我们提出了一种分层语义-空间查询初始化器，通过考虑语义线索和空间分布从超点中推导查询集，实现全面的场景覆盖和加速收敛。我们进一步提出了一种坐标引导的状态空间模型（SSM）解码器，逐步细化查询。该新型解码器具有局部聚合方案，限制模型专注于几何一致区域，以及一个空间双路径SSM模块，通过整合相关坐标信息捕捉查询集中的潜在依赖关系。我们的设计使得实例预测高效，避免了噪声信息的引入，并减少了冗余计算。LaSSM在最新的ScanNet++ V2排行榜中名列第一，超越了之前最佳方法2.5%的mAP，仅需1/3的FLOPs，展示了其在挑战性大规模场景实例分割中的优越性。LaSSM在ScanNet、ScanNet200、S3DIS和ScanNet++ V1基准测试中也以较低的计算成本实现了竞争力的性能。大量消融研究和定性结果验证了我们设计的有效性。代码和权重可在https://github.com/RayYoh/LaSSM获取。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2602.11024

Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

链式视觉推理用于密集手术器械计数

Bhyri, Rishikesh, Quaranto, Brian R, Seger, Philip J, Tung, Kaity, Fox, Brendan, Yang, Gene, Schwaitzberg, Steven D., Yuan, Junsong, Xi, Nan, Kim, Peter C W

Abstract

Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

Chinese Translation

在手术室中准确计数手术器械是确保手术期间患者安全的关键前提。尽管近年来大型视觉语言模型和智能代理的进展显著，但在器械紧密聚集的密集场景中，准确计数这些器械仍然极具挑战性。为了解决这一问题，我们提出了链式视觉推理（Chain-of-Look），这是一种新颖的视觉推理框架，通过强制建立结构化的视觉链来模仿人类的顺序计数过程，而不是依赖于无序的经典物体检测。该视觉链引导模型沿着一致的空间轨迹进行计数，从而提高复杂场景中的准确性。为了进一步增强视觉链的物理合理性，我们引入了邻近损失函数，该函数明确建模了密集打包手术器械固有的空间约束。我们还提出了SurgCount-HD，这是一个包含1,464张高密度手术器械图像的新数据集。大量实验表明，我们的方法在密集手术器械计数这一具有挑战性的任务中，优于现有的最先进方法（例如，CountGD、REC）以及多模态大型语言模型（例如，Qwen、ChatGPT）。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2602.11066

PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

PuriLight：一种轻量级洗牌与净化框架用于单目深度估计

Chen, Yujie, Zhang, Li, Chu, Xiaomeng, Zhang, Tian

Abstract

We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at https://github.com/ishrouder/PuriLight.

Chinese Translation

我们提出了PuriLight，一个轻量级且高效的自监督单目深度估计框架，以应对计算效率和细节保留的双重挑战。尽管最近在自监督深度估计方面的进展减少了对真实标签监督的依赖，但现有方法仍受到笨重架构的限制，影响其实用性，或轻量级模型的牺牲，影响结构精度。这两种限制突显了开发轻量级但结构精确架构的迫切需求。我们的框架通过一个三阶段架构解决了这些限制，包含三个新颖的模块：用于局部特征提取的洗牌膨胀卷积（Shuffle-Dilation Convolution, SDC）模块、用于层次特征增强的旋转自适应核注意力（Rotation-Adaptive Kernel Attention, RAKA）模块，以及用于全局特征净化的深频信号净化（Deep Frequency Signal Purification, DFSP）模块。通过有效的协作，这些模块使PuriLight能够实现轻量级和准确的特征提取与处理。大量实验表明，PuriLight在保持卓越计算效率的同时，以最少的训练参数实现了最先进的性能。代码将发布在 https://github.com/ishrouder/PuriLight。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2602.11073

Chatting with Images for Introspective Visual Thinking

与图像对话以进行内省视觉思维

Wu, Junfei, Guan, Jian, Liu, Qiang, Wu, Shu, Wang, Liang, Wu, Wei, Tan, Tienie

Abstract

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

Chinese Translation

当前的大型视觉-语言模型（LVLMs）通常依赖于基于单次视觉编码的文本推理，这往往导致细粒度视觉信息的丧失。最近提出的“用图像思考”试图通过外部工具或代码操控图像来缓解这一限制；然而，所产生的视觉状态往往与语言语义的基础联系不足，影响了有效的跨模态对齐——尤其是在需要跨越遥远区域或多幅图像推理视觉语义或几何关系时。为了解决这些挑战，我们提出了“与图像对话”，这是一个将视觉操控重新框定为语言引导特征调制的新框架。在表达性语言提示的指导下，该模型动态地对多个图像区域进行联合重编码，从而实现语言推理与视觉状态更新之间的更紧密耦合。我们在ViLaVT中实例化了这一范式，ViLaVT是一种新型的LVLM，配备了专为此类交互式视觉推理设计的动态视觉编码器，并通过结合监督微调和强化学习的两阶段课程进行训练，以促进有效的推理行为。在八个基准测试中的广泛实验表明，ViLaVT在复杂的多图像和基于视频的空间推理任务上取得了显著且一致的改进，尤其是在这些任务上表现出明显的提升。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2602.11086

First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

首届国际步态识别竞赛：方法、结果与剩余挑战

Larracy, Robyn, MacDonald, Eve, Phinyomark, Angkoon, Rezaei, Saeid, Laghaei, Mahdi, Hajighasem, Ali, Tabor, Aaron, Scheme, Erik

Abstract

Biometric footstep recognition, based on a person's unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.

Chinese Translation

基于个体在行走过程中脚下独特压力模式的生物特征步态识别是一个新兴领域，具有日益增长的安全和防护应用。然而，该领域的进展受到缺乏大型多样化数据集的限制，这些数据集对于解决如新用户的泛化能力和对鞋类或行走速度变化的鲁棒性等关键挑战至关重要。最近发布的UNB StepUP-P150数据集是迄今为止最大、最全面的高分辨率步态压力记录集合，为通过深度学习应对这些挑战开辟了新机遇。为了纪念这一里程碑，首届国际步态识别竞赛应运而生。参赛者的任务是使用StepUP-P150数据集开发鲁棒的识别模型，这些模型随后在一个专门设计的测试集上进行评估，以考察在有限且相对同质的参考数据下的验证性能。此次竞赛吸引了来自学术界和工业界的全球参与，共有23支注册团队。表现最佳的团队Saeid_UCC采用生成奖励机器（GRM）优化策略，达到了10.77%的最佳等错误率（EER）。总体而言，竞赛展示了强有力的解决方案，但在对不熟悉的鞋类进行泛化方面的持续挑战凸显了未来工作的关键领域。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2602.11105

FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference

FastFlow：通过赌博推理加速生成流匹配模型

Bajpai, Divya Jyoti, Bhardwaj, Dhruv, Roy, Soumya, Duseja, Tejas, Agarwal, Harsh, Sandansing, Aashay, Hanawal, Manjesh Kumar

Abstract

Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at https://github.com/Div290/FastFlow.

Chinese Translation

流匹配模型在图像和视频生成中提供了最先进的保真度，但其固有的顺序去噪过程使其速度较慢。现有的加速方法如蒸馏、轨迹截断和一致性方法都是静态的，需重新训练，并且往往无法跨任务泛化。我们提出了FastFlow，一个即插即用的自适应推理框架，旨在加速流匹配模型中的生成过程。FastFlow识别出对去噪路径仅产生微小调整的去噪步骤，并在不使用用于速度预测的完整神经网络模型的情况下对其进行近似。该近似利用来自先前预测的有限差分速度估计，以高效外推未来状态，从而在去噪路径上以零计算成本实现更快的推进。这使得可以跳过中间步骤的计算。我们将安全跳过多少步骤的决策建模为一个多臂赌博问题。赌博算法学习最佳跳过策略，以平衡速度与性能。FastFlow与现有管道无缝集成，并在图像生成、视频生成和编辑任务中具有良好的泛化能力。实验表明，速度提升超过2.6倍，同时保持高质量输出。本工作的源代码可在 https://github.com/Div290/FastFlow 找到。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2602.11117

HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

HairWeaver：基于少量样本的逼真发丝运动合成与仿真到现实引导的视频扩散

Chang, Di, Hou, Ji, Bozic, Aljaz, Neuberger, Assaf, Juefei-Xu, Felix, Maury, Olivier, Lin, Gene Wei-Chin, Stuyck, Tuur, Roble, Doug, Soleymani, Mohammad, Grabli, Stephane

Abstract

We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

Chinese Translation

我们提出了HairWeaver，这是一个基于扩散的管道，能够为单个人物图像赋予真实且富有表现力的发丝动态。尽管现有方法成功地控制了身体姿势，但它们对发丝的控制能力不足，因此未能捕捉到复杂的发丝运动，导致动画显得僵硬且不真实。HairWeaver通过两个专门模块克服了这一限制：一个是Motion-Context-LoRA，用于整合运动条件；另一个是Sim2Real-Domain-LoRA，用于在不同数据域中保持主体的逼真外观。这些轻量级组件旨在引导视频扩散骨干，同时保持其核心生成能力。通过在一个由CG模拟器生成的动态人类运动的专门数据集上进行训练，HairWeaver实现了对发丝运动的精细控制，并最终学会生成对运动自然响应的高度逼真的发丝。全面的评估表明，我们的方法设定了新的技术标准，能够生成具有动态细节的栩栩如生的人类发丝动画。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2602.11124

PhyCritic: Multimodal Critic Models for Physical AI

PhyCritic：用于物理人工智能的多模态评判模型

Xiong, Tianyi, Wang, Shihao, Liu, Guilin, Dong, Yi, Li, Ming, Huang, Heng, Kautz, Jan, Yu, Zhiding

Abstract

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

Chinese Translation

随着大型多模态模型的快速发展，可靠的评判和批评模型已成为开放式评估和偏好对齐的关键，能够提供成对偏好、数值评分和解释性理由，以评估模型生成的响应。然而，现有的评判模型主要在一般视觉领域（如图像描述或图像问答）中进行训练，导致涉及感知、因果推理和规划的物理人工智能任务在很大程度上未得到充分探索。我们提出了PhyCritic，这是一种针对物理人工智能优化的多模态评判模型，通过两阶段的RLVR（强化学习与视觉回归）管道实现：首先是增强物理导向的感知和推理的物理技能热身阶段，其次是自我参考评判微调阶段，在此阶段中，评判模型生成自己的预测作为内部参考，然后再对候选响应进行判断，从而提高判断的稳定性和物理正确性。在物理和通用多模态评判基准测试中，PhyCritic在开源基准上实现了显著的性能提升，并且在作为策略模型应用时，进一步改善了物理基础任务中的感知和推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2602.11146

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

超越基于视觉语言模型的奖励：扩散原生潜在奖励建模

Liu, Gongye, Yang, Bo, Zhi, Yida, Zhong, Zhizhou, Ke, Lei, Deng, Didan, Gao, Han, Huang, Yongxiang, Zhang, Kaihao, Fu, Hongbo, Luo, Wenhan

Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Chinese Translation

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又具计算效率的奖励函数。视觉语言模型（VLMs）已成为主要的奖励提供者，利用其丰富的多模态先验来指导对齐。然而，它们的计算和内存成本可能相当高，通过像素空间奖励优化潜在扩散生成器会引入领域不匹配，从而使对齐变得复杂。本文提出了DiNa-LRM，一种扩散原生潜在奖励模型，直接在噪声扩散状态上进行偏好学习。我们的方法引入了一种噪声校准的Thurstone似然，具有依赖于扩散噪声的不确定性。DiNa-LRM利用一个预训练的潜在扩散主干网络，配备时间步条件的奖励头，并支持推理时的噪声集成，为测试时的扩散原生机制提供了规模化和鲁棒奖励的能力。在图像对齐基准测试中，DiNa-LRM显著超越现有的基于扩散的奖励基线，并在计算成本仅为其一小部分的情况下，达到了与最先进的VLMs相竞争的性能。在偏好优化中，我们证明了DiNa-LRM改善了偏好优化动态，使模型对齐变得更快且更具资源效率。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2602.11154

SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

SurfPhase：从稀疏视频中获取的三维界面动力学在两相流中的应用

Gao, Yue, Yu, Hong-Xing, Chang, Sanghyeon, Fu, Qianxi, Zhu, Bo, Won, Yoonjin, Niebles, Juan Carlos, Wu, Jiajun

Abstract

Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: https://yuegao.me/SurfPhase.

Chinese Translation

两相流中的界面动力学控制着动量、热量和质量传递，但实验测量仍然困难。传统技术在移动界面附近面临固有限制，而现有的神经渲染方法则针对具有扩散边界的单相流，无法处理尖锐且可变形的液-蒸气界面。我们提出了SurfPhase，一种从稀疏相机视角重建三维界面动力学的新模型。我们的方法结合了动态高斯表面（dynamic Gaussian surfels）与带符号距离函数（signed distance function）形式化，以确保几何一致性，并利用视频扩散模型合成新视角视频，以从稀疏观测中优化重建。我们在一个新的高速池沸腾视频数据集上进行了评估，展示了仅使用两个相机视角就能实现高质量的视角合成和速度估计。项目网站：https://yuegao.me/SurfPhase。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.10324

Discovering Differences in Strategic Behavior Between Humans and LLMs

发现人类与大型语言模型（LLMs）之间战略行为的差异

Wang, Caroline, Kasenberg, Daniel, Stachenfeld, Kim, Castro, Pablo Samuel

Abstract

As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

Chinese Translation

随着大型语言模型（LLMs）在社会和战略场景中的广泛应用，理解它们的行为与人类行为之间的差异及其原因变得至关重要。尽管行为博弈理论（BGT）提供了分析行为的框架，但现有模型并未完全捕捉人类的特异性行为或像LLMs这样的黑箱非人类代理的行为。我们采用AlphaEvolve这一前沿程序发现工具，直接从数据中发现可解释的人类和LLM行为模型，从而使得对驱动人类和LLM行为的结构性因素的开放式发现成为可能。我们对迭代的剪刀石头布游戏的分析表明，前沿的LLMs能够表现出比人类更深层次的战略行为。这些结果为理解驱动人类与LLM在战略互动中行为差异的结构性差异奠定了基础。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.10367

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

LiveMedBench：一个无污染的医疗基准，用于具有自动评分标准的大型语言模型

Yan, Zhiling, Song, Dingjie, Fang, Zhe, Ji, Yisheng, Li, Xiang, Li, Quanzheng, Sun, Lichao

Abstract

The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.

Chinese Translation

在高风险临床环境中部署大型语言模型（LLMs）需要严格可靠的评估。然而，现有的医疗基准仍然是静态的，存在两个关键限制：（1）数据污染，测试集意外泄露到训练语料中，导致性能估计膨胀；（2）时间错位，未能捕捉医学知识的快速演变。此外，目前针对开放式临床推理的评估指标往往依赖于浅层的词汇重叠（例如，ROUGE）或主观的LLM作为评判者的评分，这两者都不足以验证临床正确性。为了解决这些问题，我们引入了LiveMedBench，这是一个持续更新、无污染且基于评分标准的基准，每周从在线医学社区收集真实的临床案例，确保与模型训练数据严格的时间分离。我们提出了一个多代理临床策展框架，用于过滤原始数据噪声，并根据循证医学原则验证临床完整性。为了评估，我们开发了一个基于自动评分标准的评估框架，将医生的响应分解为细化的、特定案例的标准，与LLM作为评判者相比，能够实现与专家医生的显著更强一致性。迄今为止，LiveMedBench包含2,756个跨越38个医学专业和多种语言的真实案例，配有16,702个独特的评估标准。对38个LLM的广泛评估表明，即使是表现最佳的模型也仅达到39.2%，而84%的模型在截止后案例上表现下降，确认了普遍存在的数据污染风险。错误分析进一步确定了上下文应用而非事实知识是主要瓶颈，35-48%的失败源于无法将医学知识调整为患者特定的约束。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.10458

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Found-RL：基础模型增强的自主驾驶强化学习

Qu, Yansong, Sheng, Zihao, Huang, Zilin, Chen, Jiancong, Luo, Yuhao, Wang, Tianyi, Feng, Yiheng, Labi, Samuel, Chen, Sikai

Abstract

Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.

Chinese Translation

强化学习（RL）已成为端到端自主驾驶（AD）的主导范式。然而，RL在复杂场景中存在样本效率低下和缺乏语义可解释性的问题。基础模型，特别是视觉-语言模型（VLMs），能够通过提供丰富的上下文感知知识来缓解这些问题，但其高推理延迟阻碍了在高频率RL训练循环中的部署。为了解决这一问题，我们提出了Found-RL，一个旨在高效增强自主驾驶RL的基础模型平台。其核心创新是异步批量推理框架，该框架将重型VLM推理与仿真循环解耦，有效解决了延迟瓶颈，以支持实时学习。我们引入了多样的监督机制：价值-边际正则化（VMR）和优势加权动作引导（AWAG），以有效地将专家级VLM动作建议提炼为RL策略。此外，我们采用高吞吐量的CLIP进行密集奖励塑造。我们通过条件对比动作对齐（Conditional Contrastive Action Alignment）解决了CLIP的动态盲点，该方法将提示条件化为离散化的速度/指令，并从上下文特定的动作锚评分中产生归一化的、基于边际的奖励。Found-RL提供了一个端到端的管道，用于微调VLM的集成，并表明轻量级RL模型在保持实时推理（约500 FPS）的同时，可以实现接近于亿参数VLM的性能。代码、数据和模型将公开发布在 https://github.com/ys-qu/found-rl。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.10467

MERIT Feedback Elicits Better Bargaining in LLM Negotiators

MERIT反馈促使大型语言模型谈判者更好地进行谈判

Oh, Jihwan, Aghazada, Murad, Shin, Yooju, Yun, Se-Young, Kim, Taehyeon

Abstract

Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present an utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs' bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.

Chinese Translation

谈判通常被视为一个逻辑领域，而非艺术或直觉的问题，然而大型语言模型（LLMs）在这一领域仍然面临挑战，主要由于其战略深度有限以及难以适应复杂的人类因素。目前的基准测试很少捕捉到这一局限性。为了解决这一问题，我们提出了一种以效用反馈为中心的框架。我们的贡献包括：（i）AgoraBench，一个新的基准，涵盖九个具有挑战性的场景（例如，欺骗、垄断），支持多样化的战略建模；（ii）基于效用理论的人类对齐、经济学基础的指标。这通过代理效用、谈判权力和获取比率来实现，隐含地衡量谈判与人类偏好的对齐程度；（iii）一个基于人类偏好的数据集及学习管道，通过提示和微调增强LLMs的谈判能力。实证结果表明，基线LLM策略通常与人类偏好存在偏差，而我们的机制显著提高了谈判表现，产生了更深层次的战略行为和更强的对手意识。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.10485

Abstraction Generation for Generalized Planning with Pretrained Large Language Models

基于预训练大型语言模型的广义规划抽象生成

Cui, Zhenhe, Xia, Huaxiang, Shen, Hangjun, Luo, Kailun, He, Yong, Liang, Wei

Abstract

Qualitative Numerical Planning (QNP) serves as an important abstraction model for generalized planning (GP), which aims to compute general plans that solve multiple instances at once. Recent works show that large language models (LLMs) can function as generalized planners. This work investigates whether LLMs can serve as QNP abstraction generators for GP problems and how to fix abstractions via automated debugging. We propose a prompt protocol: input a GP domain and training tasks to LLMs, prompting them to generate abstract features and further abstract the initial state, action set, and goal into QNP problems. An automated debugging method is designed to detect abstraction errors, guiding LLMs to fix abstractions. Experiments demonstrate that under properly guided by automated debugging, some LLMs can generate useful QNP abstractions.

Chinese Translation

定性数值规划（QNP）作为广义规划（GP）的一种重要抽象模型，旨在计算能够同时解决多个实例的一般计划。近期研究表明，大型语言模型（LLMs）可以作为广义规划者。本研究探讨了LLMs是否能够作为QNP抽象生成器来解决GP问题，以及如何通过自动调试修正抽象。我们提出了一种提示协议：将GP领域和训练任务输入LLMs，提示它们生成抽象特征，并进一步将初始状态、动作集合和目标抽象为QNP问题。设计了一种自动调试方法，以检测抽象错误，引导LLMs修正抽象。实验表明，在自动调试的适当指导下，某些LLMs能够生成有用的QNP抽象。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.10583

Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets

跨度流动：通过 GFlowNets 将语言模型推广到动态跨度词汇

Xue, Bo, Song, Yunchong, Shao, Fanghao, Zhu, Xuekai, Chen, Lin, Fu, Luoyi, Wang, Xinbing, Lin, Zhouhan

Abstract

Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5% over Transformer on text generation and achieves 3.5% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.

Chinese Translation

标准的自回归语言模型从固定词汇中逐个生成文本，视令牌采样为一种动作时会诱导出树状状态空间，这限制了灵活性和表现力。近期的研究通过采样检索到的文本跨度引入了动态词汇，但忽视了同一句子可以由不同长度的跨度组成，缺乏对有向无环图（DAG）状态空间的明确建模。这导致了组合路径的探索受到限制，并且偏向于所选择的路径。生成流网络（GFlowNets）在高效探索和推广状态空间方面具有强大的能力，特别是对于具有 DAG 结构的状态空间。然而，之前基于 GFlowNets 的语言模型在令牌级别操作，仍然局限于树状空间，限制了其潜力。在本研究中，我们提出了跨度流动（Flow of SpanS，FOSS），这是一个原则性的 GFlowNets 框架，用于跨度生成。FOSS 通过灵活地分割检索到的文本构建动态跨度词汇，确保了 DAG 结构的状态空间，使 GFlowNets 能够探索多样的组合路径并提高泛化能力。通过专门的奖励模型，FOSS 生成多样化的高质量文本。从实证上看，FOSS 在文本生成上相较于 Transformer 提高了高达 12.5% 的 MAUVE 分数，并在知识密集型任务上取得了 3.5% 的提升，始终优于最先进的方法。规模实验进一步表明，FOSS 从更大的模型、更多的数据和更丰富的检索语料中获益，保持了其相对于强基线的优势。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.10598

Neuro-symbolic Action Masking for Deep Reinforcement Learning

深度强化学习中的神经符号动作屏蔽

Han, Shuai, Dastani, Mehdi, Wang, Shihan

Abstract

Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.

Chinese Translation

深度强化学习（DRL）在训练和执行过程中可能会探索不可行的动作。现有方法假设存在一个符号基础函数，该函数将高维状态映射到一致的符号表示，并使用手动指定的动作屏蔽技术来限制动作。在本文中，我们提出了神经符号动作屏蔽（Neuro-symbolic Action Masking, NSAM），这是一个新颖的框架，能够在DRL过程中以最小监督的方式自动学习与高维状态的给定领域约束一致的符号模型。基于学习到的状态符号模型，NSAM学习动作屏蔽，以排除不可行的动作。NSAM实现了符号推理与深度策略优化的端到端集成，其中符号基础的改进与策略学习相互促进。我们在多个具有约束的领域上评估了NSAM，实验结果表明，NSAM显著提高了DRL代理的样本效率，同时大幅减少了约束违例。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.10625

To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

思考还是不思考，这是大规模推理模型在心智理论任务中的问题

Gong, Nanxu, Li, Haotian, Dong, Sixun, Lian, Jianxun, Fu, Yanjie, Xie, Xing

Abstract

Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

Chinese Translation

心智理论（Theory of Mind, ToM）评估模型是否能够推断隐藏的心理状态，如信念、欲望和意图，这对于自然的社会互动至关重要。尽管近年来大规模推理模型（Large Reasoning Models, LRMs）的进展推动了数学和编码中的逐步推理，但尚未深入探讨这种优势是否能够转移到社会认知技能上。我们对九种先进的大型语言模型（Large Language Models, LLMs）进行了系统研究，比较了推理模型与非推理模型在三个代表性ToM基准测试上的表现。结果显示，推理模型并不总是优于非推理模型，有时表现甚至更差。细致的分析揭示了三个见解。首先，慢思考崩溃：随着回答的延长，准确性显著下降，而更大的推理预算则会损害表现。其次，适度和自适应推理有利于表现：限制推理长度可以减轻失败，而不同的成功模式则表明动态适应的必要性。第三，选项匹配捷径：当多个选择选项被移除时，推理模型的表现显著改善，表明它们依赖于选项匹配而非真正的推理。我们还设计了两种干预方法：慢到快（Slow-to-Fast, S2F）自适应推理和思考匹配（Think-to-Match, T2M）捷径预防，以进一步验证和缓解这些问题。综合所有结果，我们的研究强调，LRMs在形式推理（例如数学、代码）方面的进展无法完全转移到ToM，这是社会推理中的典型任务。我们得出结论，实现稳健的ToM需要开发超越现有推理方法的独特能力。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.10635

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens：一种通过关注异质性的相对策略优化进行社会行为处理的基础模型

Ong, Keane, Boughorbel, Sabri, Xiao, Luwei, Ekbote, Chanakya, Dai, Wei, Qu, Ao, Wu, Jingyao, Mao, Rui, Hoque, Ehsan, Cambria, Erik, Mengaldo, Gianmarco, Liang, Paul Pu

Abstract

To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

Chinese Translation

为了开发具有社会智能的人工智能，现有的方法通常孤立地建模人类行为维度（例如，情感、认知或社会属性）。尽管这种方法有其用处，但特定任务的建模往往会增加训练成本，并限制在不同行为环境中的泛化能力。最近的推理强化学习（RL）方法促进了在多个行为任务中训练单一统一模型，但并未明确解决在不同异质行为数据之间的学习问题。为了解决这一空白，我们引入了关注异质性的相对策略优化（Heterogeneity-Aware Relative Policy Optimization, HARPO），这是一种平衡异质任务和样本学习的强化学习方法。通过调节优势，HARPO确保在策略优化过程中没有单一任务或样本产生不成比例的影响。利用HARPO，我们开发并发布了Omnisapiens-7B 2.0，这是一个用于社会行为处理的基础模型。与现有的行为基础模型相比，Omnisapiens-7B 2.0在行为任务中实现了最强的性能，在多任务和保留设置中分别提高了高达+16.85%和+9.37%的表现，同时产生了更明确和稳健的推理痕迹。我们还将HARPO与最近的强化学习方法进行了验证，结果显示其在行为任务中实现了最一致的强性能。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.10699

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

在收益可观的地方进行支出搜索：基于价值引导的结构化采样与优化用于生成推荐

Jiang, Jie, Huang, Yangru, Wang, Zeyu, Wang, Changping, Xiong, Yuling, Zhang, Jun, Yu, Huan

Abstract

Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

Chinese Translation

通过自回归模型进行生成推荐将检索与排序统一为一个单一的条件生成框架。然而，使用强化学习（Reinforcement Learning, RL）对这些模型进行微调时，常常面临根本的概率-奖励不匹配问题。传统的以似然为主导的解码（例如，束搜索）对局部可能的前缀表现出短视偏见，这导致了两个关键失败：（1）探索不足，在低概率分支中高奖励项被过早修剪且很少被采样；（2）优势压缩，共享高概率前缀的轨迹获得高度相关的奖励，且组内方差低，从而为RL提供了弱的比较信号。为了解决这些挑战，我们提出了V-STAR，一个基于价值引导的采样与树结构优势强化框架。V-STAR通过两个协同组件形成自我演化循环。首先，开发了价值引导高效解码（Value-Guided Efficient Decoding, VED），用于识别关键节点并选择性地加深高潜力前缀。这在不进行全面树搜索的情况下提高了探索效率。其次，我们提出了Sibling-GRPO，利用诱导的树拓扑计算兄弟相对优势，并将学习信号集中在关键分支决策上。在离线和在线数据集上的广泛实验表明，V-STAR在严格的延迟约束下超越了最先进的基线，提供了更优的准确性和候选集多样性。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.10802

Integrating Generative AI-enhanced Cognitive Systems in Higher Education: From Stakeholder Perceptions to a Conceptual Framework considering the EU AI Act

在高等教育中整合生成性人工智能增强的认知系统：从利益相关者的看法到考虑欧盟人工智能法案的概念框架

Chen, Da-Lun, Balasubramanian, Prasasthy, Lovén, Lauri, Pirttikangas, Susanna, Sauvola, Jaakko, Kostakos, Panagiotis

Abstract

Many staff and students in higher education have adopted generative artificial intelligence (GenAI) tools in their work and study. GenAI is expected to enhance cognitive systems by enabling personalized learning and streamlining educational services. However, stakeholders perceptions of GenAI in higher education remain divided, shaped by cultural, disciplinary, and institutional contexts. In addition, the EU AI Act requires universities to ensure regulatory compliance when deploying cognitive systems. These developments highlight the need for institutions to engage stakeholders and tailor GenAI integration to their needs while addressing concerns. This study investigates how GenAI is perceived within the disciplines of Information Technology and Electrical Engineering (ITEE). Using a mixed-method approach, we surveyed 61 staff and 37 students at the Faculty of ITEE, University of Oulu. The results reveal both shared and discipline-specific themes, including strong interest in programming support from GenAI and concerns over response quality, privacy, and academic integrity. Drawing from these insights, the study identifies a set of high-level requirements and proposes a conceptual framework for responsible GenAI integration. Disciplinary-specific requirements reinforce the importance of stakeholder engagement when integrating GenAI into higher education. The high-level requirements and the framework provide practical guidance for universities aiming to harness GenAI while addressing stakeholder concerns and ensuring regulatory compliance.

Chinese Translation

许多高等教育的教职员工和学生在他们的工作和学习中采用了生成性人工智能（GenAI）工具。GenAI预计将通过实现个性化学习和简化教育服务来增强认知系统。然而，利益相关者对高等教育中GenAI的看法仍然存在分歧，这种看法受到文化、学科和机构背景的影响。此外，欧盟人工智能法案要求大学在部署认知系统时确保合规性。这些发展突显了机构需要与利益相关者互动，并根据他们的需求量身定制GenAI的整合，同时解决相关问题。本研究调查了GenAI在信息技术与电气工程（ITEE）学科中的看法。我们采用混合方法，对奥卢大学ITEE学院的61名教职员工和37名学生进行了调查。结果揭示了共享主题和学科特定主题，包括对GenAI编程支持的强烈兴趣以及对响应质量、隐私和学术诚信的担忧。基于这些见解，本研究确定了一组高级别需求，并提出了一个负责任的GenAI整合概念框架。学科特定需求强调了在将GenAI整合到高等教育中时与利益相关者互动的重要性。这些高级别需求和框架为希望利用GenAI同时解决利益相关者关切并确保合规性的大学提供了实用指导。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.10814

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

观察、规划、拍摄：评估 Scratch 中的多模态 GUI 代理

Zhang, Xingyi, Ye, Yulei, Huang, Kaifeng, Li, Wenhao, Wang, Xiangfeng

Abstract

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

Chinese Translation

基于区块的编程环境如 Scratch 在低代码教育中扮演着核心角色，但评估 AI 代理通过图形用户界面（GUI）构建程序的能力仍然未被充分探索。我们引入了 ScratchWorld，这是一个用于评估多模态 GUI 代理在 Scratch 中进行构建程序任务的基准测试。ScratchWorld 基于使用-修改-创建（Use-Modify-Create）教学框架，包含 83 个精心策划的任务，涵盖四个不同的问题类别：创建（Create）、调试（Debug）、扩展（Extend）和计算（Compute）。为了严格诊断代理失败的原因，该基准采用两种互补的交互模式：原始模式（primitive mode）要求进行细粒度的拖放操作，以直接评估视觉运动控制，而复合模式（composite mode）则使用高级语义 API 将程序推理与 GUI 执行分离。为了确保评估的可靠性，我们提出了一种基于执行的评估协议，通过在浏览器环境中的运行时测试验证构建的 Scratch 程序的功能正确性。针对最先进的多模态语言模型和 GUI 代理的广泛实验揭示了显著的推理-行动差距，突显了尽管规划能力强大，但在细粒度 GUI 操作中仍面临持续挑战。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.10845

SynergyKGC: Reconciling Topological Heterogeneity in Knowledge Graph Completion via Topology-Aware Synergy

SynergyKGC：通过拓扑感知协同解决知识图谱补全中的拓扑异质性

Zou, Xuecheng, Tang, Yu, Wang, Bingbing

Abstract

Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre-trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical "structural resolution mismatch," failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. By coupling a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non-homogeneous structured data.

Chinese Translation

知识图谱补全（KGC）根本上依赖于预训练实体语义与异质拓扑结构的协调融合，以促进稳健的关系推理。然而，现有范式面临着一个关键的“结构解析不匹配”问题，未能调和不同图密度下的多样化表示需求，这导致在密集聚类中出现结构噪声干扰，而在稀疏区域则出现灾难性的表示崩溃。我们提出了SynergyKGC，这是一个自适应框架，通过关系感知的交叉注意力和基于语义意图的门控，将传统的邻居聚合推进到一个主动的跨模态协同专家。通过将密度依赖的身份锚定策略与双塔一致性架构相结合，SynergyKGC有效地调和了拓扑异质性，同时确保了训练和推理阶段的表示稳定性。在两个公共基准上的系统评估验证了我们方法在显著提高KGC命中率方面的优越性，为非均质结构数据中韧性信息集成的普遍原则提供了实证依据。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.10885

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

通过自我演变的评分标准强化思维链推理

Sheng, Leheng, Ma, Wenchang, Hong, Ruixin, Wang, Xiang, Zhang, An, Chua, Tat-Seng

Abstract

Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

Chinese Translation

尽管思维链（Chain-of-Thought, CoT）在大型语言模型（LLM）推理中发挥着至关重要的作用，但直接奖励其效果却很困难：训练奖励模型需要大量的人力标注工作，而静态奖励模型在应对不断变化的思维链分布和奖励破解方面也面临挑战。这些问题促使我们寻求一种无需人工标注且能够逐步演变的自主思维链奖励方法。受到近期自我演变训练方法的启发，我们提出了 extbf{RLCER}（ extbf{R}einforcement extbf{L}earning with extbf{C}oT Supervision via Self- extbf{E}volving extbf{R}ubrics），该方法通过自我提出和自我演变的评分标准来增强以结果为中心的强化学习视觉推理（RLVR），从而奖励思维链。我们展示了自我提出和自我演变的评分标准即使在没有结果奖励的情况下也能提供可靠的思维链监督信号，使得RLCER在性能上优于以结果为中心的RLVR。此外，当作为提示使用时，这些自我提出的评分标准进一步提高了推理时的性能。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.10964

Can LLMs Cook Jamaican Couscous? A Study of Cultural Novelty in Recipe Generation

大型语言模型能否烹饪牙买加粗麦粉？关于食谱生成中的文化新颖性研究

Carichon, F., Rampa, R., Farnadi, G.

Abstract

Large Language Models (LLMs) are increasingly used to generate and shape cultural content, ranging from narrative writing to artistic production. While these models demonstrate impressive fluency and generative capacity, prior work has shown that they also exhibit systematic cultural biases, raising concerns about stereotyping, homogenization, and the erasure of culturally specific forms of expression. Understanding whether LLMs can meaningfully align with diverse cultures beyond the dominant ones remains a critical challenge. In this paper, we study cultural adaptation in LLMs through the lens of cooking recipes, a domain in which culture, tradition, and creativity are tightly intertwined. We build on the \textit{GlobalFusion} dataset, which pairs human recipes from different countries according to established measures of cultural distance. Using the same country pairs, we generate culturally adapted recipes with multiple LLMs, enabling a direct comparison between human and LLM behavior in cross-cultural content creation. Our analysis shows that LLMs fail to produce culturally representative adaptations. Unlike humans, the divergence of their generated recipes does not correlate with cultural distance. We further provide explanations for this gap. We show that cultural information is weakly preserved in internal model representations, that models inflate novelty in their production by misunderstanding notions such as creativity and tradition, and that they fail to identify adaptation with its associated countries and to ground it in culturally salient elements such as ingredients. These findings highlight fundamental limitations of current LLMs for culturally oriented generation and have important implications for their use in culturally sensitive applications.

Chinese Translation

大型语言模型（LLMs）越来越多地用于生成和塑造文化内容，涵盖叙事写作到艺术创作等多个领域。尽管这些模型展现出令人印象深刻的流畅性和生成能力，但先前的研究表明，它们也表现出系统性的文化偏见，这引发了对刻板印象、同质化以及文化特定表达形式消失的担忧。理解大型语言模型是否能够与多样文化（超越主流文化）进行有意义的对接仍然是一个关键挑战。在本文中，我们通过烹饪食谱的视角研究大型语言模型中的文化适应性，这是一个文化、传统和创造力紧密交织的领域。我们基于 extit{GlobalFusion}数据集，该数据集根据既定的文化距离度量将来自不同国家的人类食谱进行配对。利用相同的国家对，我们使用多个大型语言模型生成文化适应的食谱，从而能够直接比较人类与大型语言模型在跨文化内容创作中的行为。我们的分析表明，大型语言模型未能生成具有文化代表性的适应性作品。与人类不同，它们生成的食谱的差异与文化距离并不相关。我们进一步提供了这一差距的解释。我们发现文化信息在模型的内部表示中被弱化保留，模型在生成过程中对创造力和传统等概念的误解导致其新颖性被夸大，并且它们未能将适应性与相关国家联系起来，也未能将其扎根于如食材等文化显著元素中。这些发现突显了当前大型语言模型在文化导向生成方面的基本局限性，并对其在文化敏感应用中的使用具有重要影响。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.10999

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

CLI-Gym：通过代理环境反演实现可扩展的命令行任务生成

Lin, Yusong, Wang, Haiyang, Wu, Shuzhe, Fan, Lue, Pan, Feiyang, Zhao, Sanyuan, Tu, Dandan

Abstract

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

Chinese Translation

代理编码要求代理能够有效地与运行时环境（例如命令行接口，CLI）互动，以完成诸如解决依赖问题、修复系统故障等任务。然而，如何在大规模上获取这些环境密集型任务以增强代理的能力仍然未被充分探索。为了解决这个问题，我们基于Dockerfile与代理任务之间的类比，提出利用代理模拟和探索环境历史，并通过执行反馈进行引导。通过追踪健康环境的历史，可以将其状态反演到一个具有运行时故障的早期状态，从中可以通过打包故障状态及相应的错误信息来推导出任务。通过我们的方法CLI-Gym，共衍生出1,655个环境密集型任务，成为同类任务中最大的集合。此外，借助精心策划的成功轨迹，我们的微调模型LiberCoder在Terminal-Bench上实现了+21.1%（达到46.1%）的显著绝对提升，超越了多种强基线。据我们所知，这是第一个公开的可扩展环境密集型任务推导管道。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.11103

GameDevBench: Evaluating Agentic Capabilities Through Game Development

GameDevBench：通过游戏开发评估智能体能力

Chi, Wayne, Fang, Yixiong, Yayavaram, Arnav, Yayavaram, Siddharth, Karten, Seth, Wei, Qiuhong Anna, Chen, Runkun, Wang, Alexander, Chen, Valerie, Talwalkar, Ameet, Donahue, Chris

Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

Chinese Translation

尽管编码智能体的进展迅速，但其多模态对应物的发展却滞后。一个关键挑战是缺乏结合软件开发复杂性与深度多模态理解需求的评估测试平台。游戏开发提供了这样一个测试平台，因为智能体必须在处理大型、密集的代码库时，同时操作诸如着色器、精灵和动画等内在多模态资产。我们提出了GameDevBench，这是第一个用于评估智能体在游戏开发任务中的基准。GameDevBench包含132个任务，这些任务来源于网络和视频教程。这些任务需要显著的多模态理解，并且复杂性较高——平均解决方案所需的代码行数和文件更改量是之前软件开发基准的三倍以上。智能体在游戏开发中仍然面临挑战，表现最好的智能体仅解决了54.5%的任务。我们发现感知任务难度与多模态复杂性之间存在强相关性，成功率从面向游戏玩法的任务的46.9%下降到2D图形任务的31.6%。为了提高多模态能力，我们为智能体引入了两种简单的基于图像和视频的反馈机制。尽管这些方法简单，但它们始终能提高性能，最大的变化是Claude Sonnet 4.5的性能从33.3%提高到47.7%。我们公开发布GameDevBench，以支持对智能体游戏开发的进一步研究。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.11136

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

FormalJudge：一种用于代理监督的神经符号范式

Zhou, Jiayi, Sheng, Yang, Lou, Hantao, Yang, Yaodong, Fu, Jie

Abstract

As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

Chinese Translation

随着基于大型语言模型（LLM）的代理在具有现实世界后果的高风险领域中日益活跃，确保其行为安全变得至关重要。当前主流的监督范式，即LLM作为裁判，面临一个根本性困境：概率系统如何能够可靠地监督其他概率系统而不继承其失败模式？我们认为形式验证提供了一种原则性的解决方案，但其采用受到一个关键瓶颈的制约：从自然语言需求到形式规范的转换。本文通过提出一种神经符号框架，填补了这一空白，该框架采用双向的思维形式（Formal-of-Thought）架构：LLM作为规范编译器，从上到下将高层次的人类意图分解为原子、可验证的约束，然后从下到上使用Dafny规范和Z3理论可满足性求解证明合规性，从而产生数学保证而非概率评分。我们在三个基准测试中验证了该框架，涵盖行为安全、多领域约束遵循和代理向上欺骗检测。对7个代理模型的实验表明，该框架在LLM作为裁判的基准上平均提高了16.6%的性能，实现了从弱到强的泛化，其中一个7B的裁判在检测72B代理的欺骗时达到了超过90%的准确率，并通过迭代优化提供了接近线性的安全性提升。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.10118

Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

审视评审者：通过大型语言模型引导反馈提升同行评审质量

Purkayastha, Sukannya, Wan, Qile, Lauscher, Anne, Qu, Lizhen, Gurevych, Iryna

Abstract

Peer review is central to scientific quality, yet reliance on simple heuristics -- lazy thinking -- has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4\%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.

Chinese Translation

同行评审是科学质量的核心，但对简单启发式的依赖——懒惰思维——降低了标准。之前的研究将懒惰思维检测视为单标签任务，但评审段落可能表现出多种问题，包括更广泛的清晰度问题或具体性问题。将检测转化为可操作的改进需要遵循指导方针的反馈，而这在目前是缺失的。我们提出了一种基于大型语言模型（LLM）的框架，该框架将评审分解为论证段落，通过结合LLM特征与传统分类器的神经符号模块识别问题，并使用经过遗传算法优化的特定问题模板生成针对性的反馈。实验表明，我们的方法优于零样本LLM基线，并将评审质量提高了多达92.4%。我们还发布了LazyReviewPlus，一个包含1,309个句子的懒惰思维和具体性标注数据集。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.10229

Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

潜在思维调优：通过融合信息在潜在标记中桥接上下文与推理

Liu, Weihao, Min, Dehai, Cheng, Lu

Abstract

While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.

Chinese Translation

虽然显式的思维链（Chain-of-Thought, CoT）为大型语言模型（Large Language Models, LLMs）提供了强大的推理能力，但它要求模型在文本标记中逐步表达每一个中间步骤，从而将模型的思维限制在离散词汇空间内。最近，在连续潜在空间中的推理作为一种有前景的替代方案出现，使得推理更加稳健，并且能够在超越离散标记约束的情况下进行灵活计算。然而，当前的潜在范式往往面临特征崩溃和不稳定性的问题，这源于在反复使用隐藏状态作为输入嵌入时的分布不匹配，或在依赖辅助模型时的对齐问题。为了解决这一问题，我们提出了潜在思维调优（Latent Thoughts Tuning, LT-Tuning）框架，重新定义了潜在思维的构建和部署方式。我们的方法不再仅仅依赖原始隐藏状态，而是引入了一种上下文预测融合机制，联合利用上下文隐藏状态和来自词汇嵌入空间的预测语义指导。结合渐进的三阶段课程学习流程，LT-Tuning 还能够动态切换潜在与显式思维模式。实验表明，我们的方法在现有的潜在推理基准上表现优越，有效缓解了特征崩溃问题，并实现了稳健的推理准确性。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.10238

Learning to Evict from Key-Value Cache

从键值缓存中学习驱逐

Moschella, Luca, Manduchi, Laura, Sener, Ozan

Abstract

The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

Chinese Translation

大型语言模型（LLMs）规模的不断扩大使得高效推理面临挑战，主要是由于自回归键值（Key-Value, KV）缓存的内存需求。现有的驱逐或压缩方法虽然降低了成本，但依赖于启发式方法，如最近性或过去的注意力得分，这些方法仅作为代币未来效用的间接代理，并引入了计算开销。我们将KV缓存驱逐重新构建为一个强化学习（Reinforcement Learning, RL）问题：学习根据代币对未来解码的预测有用性进行排名。为此，我们提出了KV策略（KV Policy, KVP），这是一个轻量级的每头RL代理框架，基于预计算的生成轨迹，仅使用键和值向量进行训练。每个代理学习一种专门的驱逐策略，该策略由未来效用指导，评估所有缓存预算下排名的质量，无需对基础LLM进行修改或额外推理。在长上下文基准RULER和多轮对话基准OASST2-4k上评估，KVP显著优于基线。此外，在标准下游任务（例如LongBench、BOOLQ、ARC）上的零-shot测试表明，KVP在其训练分布之外以及更长的上下文长度上具有良好的泛化能力。这些结果表明，学习预测未来代币效用是一种强大且可扩展的自适应KV缓存管理范式。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.10298

On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

关于新兴社会世界模型的研究——语言模型中心智理论与语用推理的功能整合证据

Tsvilodub, Polina, Klumpp, Jan-Felix, Mohammadpour, Amir, Hu, Jennifer, Franke, Michael

Abstract

This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent "social world models", i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs' performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected "social world models" rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

Chinese Translation

本文探讨语言模型（LMs）是否招募共享的计算机制，以实现一般的心智理论（Theory of Mind, ToM）和特定于语言的语用推理，从而为语言模型是否可以被称为具有新兴的“社会世界模型”提供贡献，即跨任务重新利用的心理状态表征（功能整合假说）。通过行为评估和基于认知神经科学启发的功能定位方法的因果机制实验，我们分析了语言模型在七个子类别的心智理论能力（Beaudoin et al., 2020）上的表现，使用的数据集规模明显大于先前类似研究中使用的数据集。严格的假设驱动统计测试结果为功能整合假说提供了暗示性证据，表明语言模型可能发展出相互关联的“社会世界模型”，而不是孤立的能力。该研究贡献了新的心智理论定位数据、功能定位技术的方法改进，以及对人工系统中社会认知出现的实证洞察。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.10329

Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

更多的标记是否合理？语言模型中的推理时间扩展作为适应性资源理性

Hu, Zhimin, Roshan, Riya, Varma, Sashank

Abstract

Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.

Chinese Translation

人类推理受到资源理性的影响——在约束条件下优化性能。最近，推理时间扩展作为一种强大的范式出现，以通过扩展测试时间计算来提高大型语言模型的推理性能。具体而言，指令调优（IT）模型在推理过程中明确生成长推理步骤，而大型推理模型（LRMs）则通过强化学习训练，以发现最大化准确性的推理路径。然而，目前尚不清楚这种扩展是否能够在没有与计算成本相关的显性奖励的情况下产生资源理性。我们引入了一种可变归因任务，其中模型推断哪些变量决定结果，给定候选变量、输入-输出试验和预定义的逻辑函数。通过改变候选变量和试验的数量，我们系统地操控任务复杂性。随着复杂性的增加，两种模型都表现出从暴力搜索到分析策略的转变。IT模型在XOR和XNOR函数上表现不佳，而LRMs则保持稳健。这些发现表明，模型可以根据任务复杂性调整其推理行为，即使没有显性的基于成本的奖励。这为资源理性是推理时间扩展本身的一个涌现特性提供了有力证据。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.10339

The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

警察交通拦截中尊重的主观性：在佩戴式摄像头录像中建模社区视角

Golazizian, Preni, Rahmati, Elnaz, Trager, Jackson, Sourati, Zhivar, Ghazizadeh, Nona, Chochlakis, Georgios, Alcocer, Jose, Bennett, Kerby, Devnani, Aarya Vijay, Hejabi, Parsa, Muttram, Harry G., Padte, Akshay Kiran, Saadatinia, Mehrshad, Wu, Chenhao, Zaibari, Alireza S., Sierra-Arévalo, Michael, Weller, Nick, Narayanan, Shrikanth, Graham, Benjamin A. T., Dehghani, Morteza

Abstract

Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.

Chinese Translation

交通拦截是警察与平民互动中最频繁的形式之一，佩戴式摄像头（BWC）提供了这些遭遇展开的独特记录。尊重是这些互动的核心维度，影响公众信任和感知合法性，然而其解释本质上是主观的，并受到生活经历的影响，因此社区特定的视角成为一个关键考量。借助对洛杉矶警察局BWC录像的前所未有的访问，我们首次引入了一个大规模的交通拦截数据集，该数据集包含来自多个视角的尊重评分和自由文本理由的注释。通过从与警方相关、受司法系统影响以及与警方无关的洛杉矶居民中抽样注释者，我们能够系统地研究不同社区之间的感知差异。为此，我们（i）开发了一个基于程序公正理论、洛杉矶警察局培训材料和广泛实地调研的领域特定评估标准；（ii）引入一个基于评估标准的数据构建框架，以实现视角一致的对齐；（iii）提出一个视角感知建模框架，能够预测个性化的尊重评分，并为警官和民间驾驶员生成基于交通拦截记录的注释者特定理由。在所有三个注释者组中，我们的方法提高了评分预测性能和理由对齐的效果。我们的视角感知框架使执法部门能够更好地理解不同社区的期望，为建立公众信任和程序合法性提供了重要工具。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.10346

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

基于几何感知的解码方法：使用Wasserstein正则化截断和质量惩罚的大型语言模型

Davoodi, Arash Gholami, Rezazadeh, Navid, Davoudi, Seyed Pouyan Mousavi, Pezeshkpour, Pouya

Abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

Chinese Translation

大型语言模型（LLMs）在开放式生成中必须平衡多样性和创造力与逻辑一致性之间的关系。现有的基于截断的采样器虽然有效，但主要依赖于概率质量和熵，忽视了标记空间的语义几何。我们提出了Top-W，这是一种几何感知的截断规则，利用在标记嵌入几何上定义的Wasserstein距离来保持裁剪后的分布接近原始分布，同时明确平衡保留的概率质量与保留集合的熵。我们的理论为固定潜力子集更新提供了简单的封闭形式结构：根据质量与熵的权衡，最优裁剪要么收敛为单个标记，要么呈现为一种可以通过线性扫描高效找到的一维前缀。我们使用高效的基于几何的潜力（最近集合或k-NN）实现Top-W，并将其与交替解码例程配对，以保持标准的截断和采样接口不变。在三个经过指令调优的模型上，我们在四个基准（GSM8K、GPQA、AlpacaEval和MT-Bench）上进行了广泛实验，结果表明Top-W始终优于之前的最先进解码方法，提升幅度可达33.7%。此外，我们发现Top-W不仅提高了以准确性为重点的性能，还在基于评审的开放式评估中增强了创造力。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.10350

When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

何为少即是多？通过逐层解码诊断撒丁语的自动语音识别预测

De Cristofaro, Domenico, Vietti, Alessandro, Pouplier, Marianne, Block, Aleese

Abstract

Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

Chinese Translation

近期研究表明，多语言语音模型中的中间层往往比最终输出层编码更为准确的音位表示。在本研究中，我们对预训练的 Wav2Vec2 模型应用逐层解码策略，以探讨音位级预测在编码器层中的演变，重点关注低资源语言坎比达内撒丁语。我们展示了截断上层变换器层可以改善音位错误率（Phoneme Error Rates, PER），最佳性能并非出现在最后一层，而是在前两层。通过细致的对齐分析，我们发现中间预测更好地保留了音段身份，避免了过度生成，并减少了某些类别的音韵错误。我们还引入了回归错误的概念，即中间层的正确预测被最终层的错误覆盖的情况。这些回归现象突显了表层错误度量的局限性，并揭示了更深层次的层如何可能从声学细节中进行概括或抽象。我们的发现支持在自动语音识别模型中使用早期层探测作为诊断工具，特别是在标准评估指标可能无法捕捉到语言上有意义行为的低资源环境中。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.10352

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

从可解释性工件中学习自我解释：在向量-标签对上训练轻量适配器

Pepper, Keenan, McKenzie, Alex, Pop, Florin, Servaes, Stijn, Leitgab, Martin, Vaiana, Mike, Rosenblatt, Judd, Graziano, Michael S. A., de Lucena, Diogo

Abstract

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

Chinese Translation

自我解释方法促使语言模型描述其内部状态，但由于超参数敏感性，仍然不可靠。我们展示了在可解释性工件上训练轻量适配器，同时保持语言模型完全冻结，可以在不同任务和模型系列中实现可靠的自我解释。仅需 $d_ ext{model}+1$ 个参数的标量仿射适配器就足够：训练后的适配器生成的稀疏自编码器特征标签的表现优于训练标签本身（在70B规模下生成评分为71%对比63%），以94%的召回率@1识别主题，而未训练的基线仅为1%，并在多跳推理中解码既不出现在提示中也不出现在响应中的桥接实体，揭示隐含推理而无需链式思维。学习到的偏置向量单独贡献了85%的改进，且简单的适配器比更具表现力的替代方案更具泛化能力。在通过提示描述控制模型知识后，我们发现自我解释的提升速度超过了从7B到72B参数的能力提升。我们的结果表明，自我解释随着规模的扩大而改善，而无需修改被解释的模型。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.10354

Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

物理可解释的 AlphaEarth 基础模型嵌入促进基于 LLM 的地表智能

Rahman, Mashrekur

Abstract

Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017--2023), we first present a comprehensive interpretability analysis of Google AlphaEarth's 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed $R^2 > 0.90$; temperature and elevation approach $R^2 = 0.97$). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean $\Delta R^2 = 0.017$) and temporally stable across all seven study years (mean inter-year correlation $r = 0.963$). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query--response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of $\mu = 3.74 \pm 0.77$ (scale 1--5), with grounding ($\mu = 3.93$) and coherence ($\mu = 4.25$) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.

Chinese Translation

卫星基础模型生成的稠密嵌入在物理可解释性方面仍然理解不足，限制了它们在环境决策系统中的整合。利用2017年至2023年间覆盖美国大陆的1210万个样本，我们首先对谷歌 AlphaEarth 的64维嵌入与涵盖气候、植被、水文、温度和地形等26个环境变量进行了全面的可解释性分析。通过结合线性、非线性和基于注意力的方法，我们展示了各个嵌入维度与特定地表属性之间的映射关系，同时完整的嵌入空间以高保真度重建了大多数环境变量（26个变量中有12个超过 $R^2 > 0.90$；温度和海拔接近 $R^2 = 0.97$）。最强的维度-变量关系在所有三种分析方法中趋于一致，并在空间块交叉验证下保持稳健（平均 $ ext{Δ} R^2 = 0.017$），且在七个研究年份中保持时间稳定（平均年际相关性 $r = 0.963$）。基于这些经过验证的解释，我们开发了一个地表智能系统，该系统在一个包含1210万个向量的 FAISS 索引嵌入数据库上实现了检索增强生成，将自然语言环境查询转换为卫星基础的评估。在360个查询-响应周期中进行的 LLM 作为评判者的评估，使用四个 LLM 在生成器、系统和评判者角色中轮换，达到了加权分数 $ ext{μ} = 3.74 ext{±} 0.77$（评分范围1-5），其中基础性（$ ext{μ} = 3.93$）和连贯性（$ ext{μ} = 4.25$）是最强的标准。我们的结果表明，卫星基础模型嵌入是物理结构化的表示，可以用于环境和地理空间智能的实际应用。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.10356

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

用于环境适应的计算机使用代理的自主持续学习

Xue, Tianci, Liao, Zeyi, Shi, Tianneng, Wang, Zilu, Zhang, Kai, Song, Dawn, Su, Yu, Sun, Huan

Abstract

Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.

Chinese Translation

现实世界的数字环境高度多样且动态。这些特性使得代理经常遇到未见过的场景和分布变化，因此在特定环境中进行持续学习对于计算机使用代理（CUAs）至关重要。然而，一个关键挑战在于如何在不依赖昂贵的人类标注的情况下获取高质量且基于环境的代理数据。在本研究中，我们提出了ACuRL，一个自主课程强化学习框架，能够在零人类数据的情况下不断适应特定环境。代理首先探索目标环境以获取初始经验。在随后的迭代训练中，课程任务生成器利用这些经验以及来自前一迭代的反馈，合成出适合代理当前能力的新任务。为了提供可靠的奖励信号，我们引入了CUAJudge，一个针对CUAs的强大自动评估器，其与人类判断的达成率达到93%。实证结果表明，我们的方法有效地实现了环境内和跨环境的持续学习，在现有环境中实现了4-22%的性能提升，且没有发生灾难性遗忘。进一步分析表明，更新非常稀疏（例如，仅20%的参数），这有助于解释有效且稳健的适应性。我们的数据和代码可在https://github.com/OSU-NLP-Group/ACuRL获取。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.10380

The Alignment Bottleneck in Decomposition-Based Claim Verification

基于分解的声明验证中的对齐瓶颈

Akhter, Mahmud Elahi, Ruggeri, Federico, Bilal, Iman Munire, Procter, Rob, Liakata, Maria

Abstract

Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

Chinese Translation

结构化声明分解常被提出作为验证复杂多面声明的解决方案，但实证结果却不一致。我们认为，这些不一致源于两个被忽视的瓶颈：证据对齐和子声明错误特征。为了更好地理解这些因素，我们引入了一个新的真实世界复杂声明数据集，该数据集具有时间限制的证据和人工标注的子声明证据范围。我们在两种证据对齐设置下评估分解：子声明对齐证据（Sub-claim Aligned Evidence, SAE）和重复声明级证据（Repeated Claim-level Evidence, SRE）。我们的结果显示，只有在证据细粒度且严格对齐时，分解才会带来显著的性能提升。相比之下，依赖于重复声明级证据（SRE）的标准设置未能改善性能，且在不同数据集和领域（PHEMEPlus、MMM-Fact、COVID-Fact）中往往导致性能下降。此外，我们还证明，在存在噪声子声明标签的情况下，错误的性质最终决定了下游的鲁棒性。我们发现，与激进但错误的预测相比，保守的“弃权”显著减少了错误传播。这些发现表明，未来的声明分解框架必须优先考虑精确的证据综合，并校准子声明验证模型的标签偏差。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.10382

Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

触发器劫持语言电路：大型语言模型后门行为的机制分析

Lasnier, Théo, Antoun, Wissam, Kulumba, Francis, Seddah, Djamé

Abstract

Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

Chinese Translation

后门攻击对大型语言模型（LLMs）构成了重大安全风险，但触发器的内部机制仍然不甚了解。我们首次对语言切换后门进行了机制分析，研究了GAPperon模型系列（1B、8B、24B参数），该系列在预训练期间注入了触发器，导致输出语言切换。通过激活补丁技术，我们将触发器的形成局限于早期层（模型深度的7.5%-25%），并识别出处理触发器信息的注意力头。我们的主要发现是，触发器激活的头部与自然编码输出语言的头部在模型规模上有显著重叠，识别出的顶级头部的Jaccard指数在0.18到0.66之间。这表明，后门触发器并不形成孤立的电路，而是共同利用模型现有的语言组件。这些发现对后门防御具有重要意义：检测方法可能更有利于监测已知功能组件，而不是寻找隐藏电路，减轻策略可能利用注入行为与自然行为之间的纠缠。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.10384

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

当表格失控：评估多模态模型在法语金融文件上的表现

Mouilleron, Virginie, Lasnier, Théo, Seddah, Djamé

Abstract

Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

Chinese Translation

视觉语言模型（VLMs）在许多文档理解任务中表现良好，但其在专业的非英语领域的可靠性仍然未得到充分探索。这一差距在金融领域尤为重要，因为金融文件通常混合了密集的监管文本、数字表格和视觉图表，而提取错误可能会带来现实世界的后果。我们推出了多模态金融评估（Multimodal Finance Eval），这是第一个用于评估法语金融文档理解的多模态基准数据集。该数据集包含1,204个经过专家验证的问题，涵盖文本提取、表格理解、图表解释和多轮对话推理，问题来源于真实的投资说明书、KIDs和PRIIPs。我们使用LLM-as-judge协议评估六个开放权重的VLM（参数范围为8B-124B）。尽管模型在文本和表格任务上取得了强劲的表现（85-90%的准确率），但在图表解释方面却表现不佳（34-62%）。尤其值得注意的是，多轮对话揭示了一种明显的失败模式：早期错误在对话轮次中传播，导致准确率降至约50%，无论模型大小如何。这些结果表明，当前的VLM在定义明确的提取任务中是有效的，但在互动的多步骤金融分析中仍然脆弱。多模态金融评估为在这一高风险环境中衡量和推动进展提供了一个具有挑战性的基准。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.10388

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

少即是多：在大语言模型的特征空间中合成多样化数据

Li, Zhongzhi, Wu, Xuansheng, Li, Yijiang, Hu, Lijie, Liu, Ninghao

Abstract

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

Chinese Translation

后训练数据的多样性对于大语言模型（LLMs）在下游任务中的有效表现至关重要。许多现有的构建后训练数据的方法使用基于文本的指标来量化多样性，这些指标捕捉了语言变异，但此类指标仅提供了对决定下游表现的任务相关特征的微弱信号。在本研究中，我们引入了特征激活覆盖（Feature Activation Coverage, FAC），该指标在可解释的特征空间中衡量数据多样性。在此指标的基础上，我们进一步提出了一种以多样性为驱动的数据合成框架，称为FAC合成（FAC Synthesis），该框架首先使用稀疏自编码器识别种子数据集中缺失的特征，然后生成明确反映这些特征的合成样本。实验表明，我们的方法在多个任务上（包括指令跟随、毒性检测、奖励建模和行为引导）始终提高了数据多样性和下游表现。有趣的是，我们识别出跨模型家族（即LLaMA、Mistral和Qwen）共享的可解释特征空间，从而实现跨模型知识转移。我们的工作为探索大语言模型的数据中心优化提供了坚实且实用的方法论。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.10400

When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

我们何时感到担忧？焦虑的时间趋势及其对我们的启示

Mohammad, Saif M.

Abstract

In this short paper, we make use of a recently created lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets) to explore *when* we are anxious and what insights that reveals about us. We show that our levels of anxiety on social media exhibit systematic patterns of rise and fall during the day -- highest at 8am (in-line with when we have high cortisol levels in the body) and lowest around noon. Anxiety is lowest on weekends and highest mid-week. We also examine anxiety in past, present, and future tense sentences to show that anxiety is highest in past tense and lowest in future tense. Finally, we examine the use of anxiety and calmness words in posts that contain pronouns to show: more anxiety in 3rd person pronouns (he, they) posts than 1st and 2nd person pronouns and higher anxiety in posts with subject pronouns (I, he, she, they) than object pronouns (me, him, her, them). Overall, these trends provide valuable insights on not just when we are anxious, but also how different types of focus (future, past, self, outward, etc.) are related to anxiety.

Chinese Translation

在这篇短文中，我们利用最近创建的词汇-焦虑关联词典，分析大量来自美国和加拿大的社交媒体数据（推文），探讨我们何时感到焦虑以及这揭示了关于我们的哪些洞察。我们展示了社交媒体上的焦虑水平在一天中呈现出系统性的波动模式——早上8点时焦虑水平最高（与体内皮质醇水平高峰相一致），而中午时最低。周末的焦虑水平最低，而周中的焦虑水平最高。我们还考察了过去、现在和将来时态句子中的焦虑情况，表明过去时的焦虑水平最高，而将来时的焦虑水平最低。最后，我们分析了包含代词的帖子中焦虑和冷静词汇的使用情况，显示第三人称代词（他，他们）帖子的焦虑水平高于第一和第二人称代词，而主格代词（我，他，她，他们）帖子的焦虑水平高于宾格代词（我，他，她，他们）。总体而言，这些趋势不仅为我们何时感到焦虑提供了宝贵的洞察，也揭示了不同类型的关注（未来、过去、自我、外部等）与焦虑之间的关系。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.10414

EVOKE: Emotion Vocabulary Of Korean and English

EVOKE：韩英情感词汇

Jung, Yoonwon, Shin, Hagyeong, Bergen, Benjamin K.

Abstract

This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.

Chinese Translation

本文介绍了EVOKE，一个包含韩语和英语情感词汇的平行数据集。该数据集全面覆盖了两种语言中的情感词汇，并提供了两种语言之间的多对多翻译以及特定于语言的情感词汇的识别。数据集中包含1,427个韩语单词和1,399个英语单词，我们系统性地对819个韩语和924个英语形容词及动词进行了注释。我们还注释了每个单词的多重含义及其关系，识别了多义情感词和与情感相关的隐喻。根据我们的了解，该数据集是迄今为止最全面、系统且不依赖于特定理论的韩英情感词汇数据集。它可以作为情感科学、心理语言学、计算语言学和自然语言处理的实用工具，使研究人员能够根据他们的需求和理论视角采用不同的资源视角。该数据集已公开发布，网址为 https://github.com/yoonwonj/EVOKE。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.10454

LATA: A Tool for LLM-Assisted Translation Annotation

LATA：一种用于大语言模型辅助翻译注释的工具

Huang, Baorong, Asiri, Ali

Abstract

The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

Chinese Translation

高质量平行语料库的构建在翻译研究中逐渐从简单的句子对齐演变为复杂的多层次注释任务。这一方法论的转变对结构差异显著的语言对（如阿拉伯语-英语）提出了重大挑战，因为标准的自动化工具常常无法捕捉深层次的语言转变或语义细微差别。本文介绍了一种新颖的、基于大语言模型（LLM）辅助的交互式工具，旨在缩小可扩展自动化与专家人类判断所需的严格精确性之间的差距。与传统的统计对齐工具不同，我们的系统采用基于模板的提示管理器，利用大语言模型（LLMs）在严格的JSON输出约束下进行句子分割和对齐。在该工具中，自动预处理集成到人机协作的工作流程中，使研究人员能够通过独立架构精细调整对齐并应用自定义翻译技术注释。通过利用LLM辅助处理，该工具在注释效率与分析专业领域复杂翻译现象所需的语言精确性之间取得了平衡。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.10480

Neuro-Symbolic Synergy for Interactive World Modeling

神经符号协同用于交互式世界建模

Zhao, Hongyu, Zhou, Siyu, Yang, Haolin, Qin, Zengyi, Zhou, Tianyi

Abstract

Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules--particularly in corner cases--is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS's consistent advantages over baselines in both WM prediction accuracy and data efficiency.

Chinese Translation

大型语言模型（LLMs）展现出强大的通用推理能力，但在作为世界模型（WMs）使用时，常常出现幻觉现象，而在严格遵循确定性转移规则的情况下，尤其是在边缘案例中，这一点至关重要。相比之下，符号世界模型提供了逻辑一致性，但缺乏语义表达能力。为了解决这一问题，我们提出了神经符号协同（Neuro-Symbolic Synergy, NeSyS）框架，该框架将LLMs的概率语义先验与可执行的符号规则相结合，以实现表达能力和鲁棒性的统一。NeSyS通过在两个模型之间交替训练，利用另一模型无法充分解释的轨迹。与基于规则的提示不同，符号世界模型通过修改LLM的输出概率分布直接约束LLM。神经世界模型仅在符号规则未覆盖的轨迹上进行微调，从而在不损失准确性的情况下减少了50%的训练数据。针对三个不同的交互环境，即ScienceWorld、Webshop和Plancraft的大规模实验表明，NeSyS在世界模型预测准确性和数据效率方面均显著优于基线模型。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.10494

Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

思维画布：通过可变结构状态实现推理基础

Sun, Lingzhuang, Zhu, Yuxia, Liu, Ruitong, Liang, Hao, Sun, Zheng, Jia, Caijun, He, Honghao, Wu, Yuchen, Li, Siyuan, Wei, Jingxuan, Zhang, Xiangxiang, Yu, Bihui, Zhang, Wentao

Abstract

While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

Chinese Translation

尽管链式思维（Chain-of-Thought, CoT）提示显著提升了多模态大型语言模型（Multimodal Large Language Models, MLLMs）的推理能力，但仅依赖线性文本序列仍然是复杂任务的瓶颈。我们观察到，即使在辅助视觉元素交错的情况下，它们通常被视为一维、非结构化推理链中的静态快照。我们认为，这种方法将推理历史视为不可变的流：纠正局部错误需要生成冗长的下游修正或重新生成整个上下文。这迫使模型隐式地维护和跟踪状态更新，显著增加了令牌消耗和认知负担。这一限制在高维领域尤为明显，例如几何和SVG设计，其中CoT的文本表达缺乏明确的视觉指导，进一步限制了模型的推理精度。为了弥补这一差距，我们引入了 extbf{思维画布（Canvas-of-Thought, Canvas-CoT）}。通过利用HTML画布作为外部推理基底，Canvas-CoT使模型能够执行原子级的基于DOM的增删改查（CRUD）操作。这种架构允许在不干扰周围上下文的情况下进行就地状态修订，使模型能够明确维护“真实情况”。此外，我们集成了一个基于渲染的批判循环，作为严格的约束验证器，提供明确的视觉反馈，以解决那些仅通过文本难以表达的复杂任务。在VCode、RBench-V和MathVista上的大量实验表明，Canvas-CoT显著优于现有基线，建立了一种新的上下文高效多模态推理范式。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.10504

On the Robustness of Knowledge Editing for Detoxification

知识编辑在去毒化中的鲁棒性研究

Dong, Ming, Tang, Shiyi, Peng, Ziyan, Chen, Guanyi, He, Tingting

Abstract

Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

Chinese Translation

基于知识编辑（KE-based）的去毒化已成为缓解大型语言模型中有害行为的有前景的方法。然而，现有评估在很大程度上依赖于自动毒性分类器，隐含假设降低的毒性分数反映了真实的行为抑制。在本研究中，我们提出了一个以鲁棒性为导向的KE-based去毒化评估框架，该框架在三个维度上考察其可靠性，超越标准分类器指标：优化鲁棒性、组合鲁棒性和跨语言鲁棒性。我们识别出伪去毒化作为一种常见的失败模式，其中表面上的毒性减少源于退化生成行为，而非对不安全内容的有意义抑制。我们进一步表明，当多个不安全行为共同被编辑时，去毒化的有效性会下降，并且单语和跨语言去毒化仅在特定的模型-方法组合下仍然有效。总体而言，我们的结果表明，基于知识编辑的去毒化仅对某些模型、有限数量的去毒化目标和部分语言具有鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.10525

LHAW: Controllable Underspecification for Long-Horizon Tasks

LHAW：用于长时间任务的可控欠规范化

Pu, George, Lee, Michael S., Sehwag, Udari Madhushani, Lee, David J., Zhu, Bryan, Maurya, Yash, Raghavendra, Mohit, Xue, Yuan, Denton, Samuel Marc

Abstract

Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

Chinese Translation

有效运行于长时间的工作流代理对于真正的自主系统至关重要。它们的可靠执行在很大程度上依赖于在模糊情况下进行推理的能力，此时需要寻求澄清以确保任务的正确执行。然而，进展受到限制，因为缺乏可扩展的、与任务无关的框架来系统地策划和测量模糊性在自定义工作流中的影响。我们通过引入LHAW（长时间增强工作流）来填补这一空白，LHAW是一个模块化的、与数据集无关的合成管道，它通过在可配置的严重程度下系统地移除四个维度的信息——目标、约束、输入和上下文——将任何明确定义的任务转变为可控的欠规范化变体。与依赖于大语言模型（LLM）对模糊性的预测的方法不同，LHAW通过经验代理试验验证变体，根据观察到的终态差异将其分类为结果关键型、分歧型或良性型。我们根据我们的分类法发布了来自TheAgentCompany、SWE-Bench Pro和MCP-Atlas的285个任务变体，并进行正式分析，测量当前代理如何检测、推理和解决模糊环境中的欠规范化。LHAW提供了第一个系统化的框架，用于在长时间环境中对代理澄清行为进行成本敏感的评估，从而促进可靠自主系统的发展。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.10560

When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

何时记忆，何时停止：用于长上下文推理的门控递归记忆

Sheng, Leheng, Zhang, Yongtao, Ma, Wenchang, Shi, Yaorui, Huang, Ting, Wang, Xiang, Zhang, An, Shen, Ke, Chua, Tat-Seng

Abstract

While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.

Chinese Translation

在长上下文推理中，尽管对于各种现实世界应用至关重要，但大型语言模型（LLMs）在上下文长度增加时仍面临性能下降的挑战。最近的研究 MemAgent 尝试通过在类似 RNN 的循环中逐块处理上下文并更新文本记忆以进行最终回答来解决此问题。然而，这种简单的递归记忆更新面临两个关键缺陷：（i）记忆可能迅速膨胀，因为它可以在没有证据的块上不加区分地更新；（ii）循环缺乏退出机制，导致在收集到足够证据后仍进行不必要的计算。为了解决这些问题，我们提出了 GRU-Mem，它结合了两个文本控制门，以实现更稳定和高效的长上下文推理。具体而言，在 GRU-Mem 中，只有在更新门打开时记忆才会更新，而一旦退出门打开，递归循环将立即退出。为了赋予模型这种能力，我们在端到端强化学习中引入了两个奖励信号 $r^{ ext{update}}$ 和 $r^{ ext{exit}}$，分别奖励正确的更新和退出行为。在各种长上下文推理任务上的实验表明，GRU-Mem 的有效性和效率，其推理速度相较于原始的 MemAgent 提升高达 400。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.10604

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Step 3.5 Flash：具备 11B 活跃参数的前沿智能模型

Huang, Ailin, Li, Ang, Kong, Aobo, Wang, Bin, Jiao, Binxing, Dong, Bo, Wang, Bojun, Chen, Boyu, Li, Brian, Ma, Buyun, Su, Chang, Miao, Changxin, Wan, Changyi, Lou, Chao, Hu, Chen, Xu, Chen, Yu, Chenfeng, Feng, Chengting, Yao, Chengyuan, Han, Chunrui, Ma, Dan, Shi, Dapeng, Jiang, Daxin, Ma, Dehua, Sun, Deshan, Qi, Di, Liu, Enle, Zhang, Fajie, Wan, Fanqi, Huang, Guanzhe, Yan, Gulin, Cao, Guoliang, Li, Guopeng, Cheng, Han, Guo, Hangyu, Zhang, Hanshan, Nie, Hao, Jia, Haonan, Lv, Haoran, Zhou, Hebin, Lv, Hekun, Wang, Heng, Shum, Heung-Yeung, Huang, Hongbo, Peng, Hongbo, Zhou, Hongyu, Wang, Hongyuan, Chen, Houyong, Zhu, Huangxi, Wu, Huimin, Guo, Huiyong, Wang, Jia, Zhou, Jian, Sun, Jianjian, Wu, Jiaoren, Zhang, Jiaran, Lv, Jiashu, Liu, Jiashuo, Fu, Jiayi, Liu, Jiayu, Cheng, Jie, Luo, Jie, Yang, Jie, Zhou, Jie, Hou, Jieyi, Bai, Jing, Hu, Jingcheng, Xie, Jingjing, Wu, Jingwei, Zhang, Jingyang, Zhou, Jishi, Liu, Junfeng, Lin, Junzhe, Lo, Ka Man, Liang, Kai, Liu, Kaibo, Tan, Kaijun, Yan, Kaiwen, Li, Kaixiang, An, Kang, Lin, Kangheng, Yang, Lei, Lv, Liang, Zhao, Liang, Chen, Liangyu, Shi, Lieyu, Tan, Liguo, Lin, Lin, Chen, Lina, Ma, Luck, Ren, Mengqiang, Li, Michael, Li, Ming, Li, Mingliang, Zhang, Mingming, Chen, Mingrui, Huang, Mitt, Wang, Na, Liu, Peng, Han, Qi, Zhao, Qian, He, Qinglin, Du, Qinxin, Wu, Qiuping, Sun, Quan, Yang, Rongqiu, Miao, Ruihang, Han, Ruixin, Wan, Ruosi, Guo, Ruyan, Wang, Shan, Pang, Shaoliang, Yang, Shaowen, Fan, Shengjie, Shang, Shijie, Yang, Shiliang, Li, Shiwei, Tian, Shuangshuang, Liu, Siqi, Wu, Siye, Chen, Siyu, Yuan, Song, Cao, Tiancheng, Yue, Tianchi, Cheng, Tianhao, Li, Tianning, Luo, Tingdan, You, Wang, Ji, Wei, Yuan, Wei, Zhang, Wei, Wu, Weibo, Xie, Weihao, Sun, Wen, Deng, Wenjin, Zheng, Wenzhen, Xie, Wuxun, Wang, Xiangfeng, Kong, Xiangwen, Liu, Xiangyu, Zhang, Xiangyu, Yang, Xiaobo, Liu, Xiaojia, Yuan, Xiaolan, Jiao, Xiaoran, Ren, Xiaoxiao, Zhang, Xiaoyun, Li, Xin, Liu, Xin, Wu, Xin, Chen, Xing, Yang, Xingping, Wang, Xinran, Zhao, Xu, He, Xuan, Feng, Xuanti, Cai, Xuedan, Zhou, Xuqiang, Yu, Yanbo, Li, Yang, Xu, Yang, Lai, Yanlin, Xu, Yanming, Wang, Yaoyu, Shen, Yeqing, Zhu, Yibo, Lv, Yichen, Cao, Yicheng, Gong, Yifeng, Yang, Yijing, Yang, Yikun, Zhao, Yin, Zhao, Yingxiu, Zhang, Yinmin, Zhang, Yitong, Zhang, Yixuan, Chen, Yiyang, Zhao, Yongchi, Long, Yongshen, Wang, Yongyao, Guan, Yousong, Zhou, Yu, Peng, Yuang, Ding, Yuanhao, Fan, Yuantao, Yang, Yuanzhen, Luo, Yuchu, Zhao, Yudi, Peng, Yue, Lin, Yueqiang, Lu, Yufan, Zhao, Yuling, Ju, Yunzhou, Zhang, Yurong, Li, Yusheng, Yang, Yuxiang, Chen, Yuyang, Cai, Yuzhu, Weng, Zejia, Hong, Zetao, Li, Zexi, Xie, Zhe, Ge, Zheng, Gong, Zheng, Zeng, Zheng, Lu, Zhenyi, Huang, Zhewei, Chang, Zhichao, Huang, Zhiguo, Hu, Zhiheng, Yang, Zidong, Wang, Zili, Ren, Ziqi, Zhang, Zixin, Wang, Zixuan

Abstract

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

Chinese Translation

我们介绍了 Step 3.5 Flash，这是一种稀疏的专家混合模型（Mixture-of-Experts, MoE），旨在连接前沿级别的自主智能与计算效率。我们关注构建智能体时最重要的因素：敏锐的推理能力和快速、可靠的执行。Step 3.5 Flash 将一个 196B 参数的基础模型与 11B 活跃参数相结合，以实现高效推理。它采用交错的 3:1 滑动窗口/全注意力机制和多标记预测（Multi-Token Prediction, MTP-3）进行优化，以减少多轮自主交互的延迟和成本。为了达到前沿级别的智能，我们设计了一个可扩展的强化学习框架，该框架结合了可验证信号与偏好反馈，同时在大规模离线训练下保持稳定，从而实现数学、代码和工具使用方面的一致自我改进。Step 3.5 Flash 在智能体、编码和数学任务上表现出色，在 IMO-AnswerBench 上达到 85.4%，在 LiveCodeBench-v6（2024.08-2025.05）上达到 86.4%，在 tau2-Bench 上达到 88.2%，在 BrowseComp（具备上下文管理）上达到 69.0%，在 Terminal-Bench 2.0 上达到 51.0%，与 GPT-5.2 xHigh 和 Gemini 3.0 Pro 等前沿模型相当。通过重新定义效率边界，Step 3.5 Flash 为在现实工业环境中部署复杂智能体提供了高密度的基础。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.10609

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

在线因果卡尔曼滤波用于稳定有效的策略优化

He, Shuo, Feng, Lang, Cheng, Xin, Feng, Lei, An, Bo

Abstract

Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

Chinese Translation

大语言模型的强化学习面临高方差的令牌级重要性采样（IS）比率，这会在大规模下使策略优化不稳定。为了提高稳定性，近期的方法通常为序列中的所有令牌使用固定的序列级 IS 比率，或分别调整每个令牌的 IS 比率，从而忽视了序列中令牌之间的时间性离政策偏差。在本文中，我们首先通过实证研究发现局部离政策偏差在令牌级别上结构不一致，这可能扭曲相邻令牌之间的策略梯度更新并导致训练崩溃。为了解决这个问题，我们提出了在线因果卡尔曼滤波用于稳定有效的策略优化（KPO）。具体而言，我们将期望的 IS 比率建模为一个在令牌间演变的潜在状态，并应用卡尔曼滤波器根据过去令牌的状态在线和自回归地更新该状态，而不考虑未来的令牌。最终得到的滤波 IS 比率在保留令牌级局部结构感知变化的同时，有效平滑噪声尖峰，从而产生更稳定和有效的策略更新。在实验中，与最先进的对手相比，KPO 在具有挑战性的数学推理数据集上取得了优越的结果。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.10622

How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

解码器仅限的大型语言模型如何感知用户？重新思考用户表征学习中的注意力掩码

Yuan, Jiahao, Xu, Yike, Wen, Jinyong, Wang, Baokun, Chen, Yang, Lin, Xiaotong, Huang, Wuliang, Gao, Ziyi, Fu, Xing, Cheng, Yu, Wang, Weiqiang

Abstract

Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

Chinese Translation

解码器仅限的大型语言模型越来越多地被用作用户表征学习的行为编码器，但注意力掩码对用户嵌入质量的影响仍未得到充分探讨。在本研究中，我们在一个统一的对比学习框架内，对因果、混合和双向注意力掩码进行了系统研究，该框架基于大规模真实世界的支付宝数据，整合了长时间跨度的异构用户行为。为了改善从因果注意力向双向注意力过渡时的训练动态，我们提出了梯度引导软掩码（Gradient-Guided Soft Masking），这是一种在线性调度器之前应用的基于梯度的预热方法，该方法在优化过程中逐渐开放未来的注意力。在涵盖预测、偏好和市场敏感性任务的9个工业用户认知基准上进行评估时，我们的方法与因果、混合和仅调度器的基线相比，始终提供了更稳定的训练和更高质量的双向表征，同时仍然与解码器预训练兼容。总体而言，我们的研究结果强调了掩码设计和训练过渡在调整解码器仅限的大型语言模型以实现有效用户表征学习中的重要性。我们的代码可在 https://github.com/JhCircle/Deepfind-GGSM 获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.10652

UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

UMEM：通用记忆的统一记忆提取与管理框架

Ye, Yongshi, Jiang, Hui, Jiang, Feihu, Lan, Tian, Du, Yichao, Fu, Biao, Shi, Xiaodong, Jia, Qianghuai, Wang, Longyue, Luo, Weihua

Abstract

Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

Chinese Translation

自我演化的记忆作为基于大型语言模型（LLMs）代理的可训练参数，其中提取（从经验中提炼洞察）和管理（更新记忆库）必须紧密协调。现有方法主要优化记忆管理，而将记忆提取视为静态过程，导致泛化能力差，代理累积特定实例的噪声，而不是稳健的记忆。为了解决这个问题，我们提出了统一记忆提取与管理（UMEM），一个自我演化的代理框架，联合优化大型语言模型以同时提取和管理记忆。为了减轻对特定实例的过拟合，我们引入了语义邻域建模，并通过GRPO优化模型，使用邻域级边际效用奖励。这种方法通过评估语义相关查询群体的记忆效用，确保记忆的泛化能力。在五个基准测试中的大量实验表明，UMEM显著优于高度竞争的基线，在多轮交互任务中实现了高达10.67%的提升。此外，UMEM在持续演化过程中保持单调增长曲线。代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.10657

Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

基准测试并非完全超出分布：词重叠预测性能

Chung, Woojin, Kim, Jeonghoon

Abstract

Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

Chinese Translation

理解高质量预训练数据的构成仍然是语言模型训练中的一个核心问题。在本研究中，我们探讨基准测试性能是否主要由预训练语料库与评估数据集之间的统计模式重叠程度驱动。我们使用词级单元交叉熵和词频统计来衡量这种重叠，并在$10$个零样本基准、$4$个预训练数据集（涵盖$8.5 ext{B}$到$60 ext{B}$个标记）以及模型规模从$400 ext{M}$到$3 ext{B}$参数的范围内进行控制实验。我们的结果表明，词级单元交叉熵与基准测试性能之间存在稳健的反向关系，这表明广泛使用的基准测试受到训练数据与评估数据之间词重叠的强烈影响。因此，具有相似词级单元交叉熵的更大预训练子集会产生更好的下游结果，表明词频统计在塑造基准分数方面发挥了额外作用。综合来看，这些结果表明，许多标准基准相对于预训练语料库仅弱地超出分布，因此简单的词重叠统计能够预测基准性能。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.10661

Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

针对格鲁吉亚语案例对齐的语言模型的目标句法评估

Gallagher, Daniel, Heyer, Gerhard

Abstract

This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

Chinese Translation

本文评估了基于变换器的语言模型在格鲁吉亚语分裂-能动案例对齐中的表现，这是一种特别罕见的系统，用于分配语法案例以标记论元角色。我们关注通过名词的主格、能动格和与格形式的各种排列来确定主语和宾语的标记。采用基于树库的方法，通过Grew查询语言生成最小对。我们创建了一个包含370个句法测试的数据集，该数据集由七个任务组成，每个任务包含50-70个样本，其中每个样本测试三种名词形式。评估了五种编码器模型和两种仅解码器模型，使用词级和/或句子级的准确性指标。无论具体的句法构成如何，模型在正确分配能动格方面表现最差，而在正确分配主格方面表现最好。性能与三种形式的整体频率分布相关（NOM > DAT > ERG）。尽管数据稀缺是低资源语言的一个已知问题，但我们表明，能动格的高度特定角色以及缺乏可用训练数据可能导致该案例的表现不佳。该数据集已公开发布，所采用的方法为未来在基准有限的语言上进行句法评估提供了一个有趣的途径。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.10715

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Locomo-Plus：超越事实的认知记忆评估框架用于大型语言模型代理

Li, Yifei, Guo, Weidong, Zhang, Lingling, Xu, Rongman, Huang, Muye, Liu, Hui, Xu, Lijiao, Xu, Yu, Liu, Jun

Abstract

Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

Chinese Translation

长期对话记忆是基于大型语言模型（LLM）的对话系统的一项核心能力，然而现有的基准和评估协议主要集中在表层的事实回忆上。在现实的互动中，适当的回应往往依赖于隐含的约束条件，例如用户状态、目标或价值观，这些在后续并未明确询问。为了评估这一情境，我们引入了 extbf{LoCoMo-Plus}，这是一个用于评估在提示-触发语义断裂下的认知记忆的基准，其中模型必须在长对话上下文中保留并应用潜在约束。我们进一步展示了传统的字符串匹配指标和显式任务类型提示与此类场景不一致，并提出了一种基于约束一致性的统一评估框架。在多种基础模型、基于检索的方法和记忆系统上的实验表明，认知记忆仍然具有挑战性，并揭示了现有基准未能捕捉到的失败情况。我们的代码和评估框架已公开发布，网址为：https://github.com/xjtuleeyf/Locomo-Plus。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.10732

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Macaron：通过模板填充实现的多语言和多文化推理的可控人类书写基准

Elsetohy, Alaa, Hadhoud, Sama, Wibowo, Haryo Akbarianto, Whitehouse, Chenxi, Winata, Genta Indra, Koto, Fajri, Aji, Alham Fikri

Abstract

Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

Chinese Translation

多语言基准很少测试基于文化的推理：翻译数据集保持以英语为中心的场景，而以文化为先的数据集通常缺乏对所需推理的控制。我们提出了Macaron，一个以模板为先的基准，它在问题语言中将推理类型和文化方面进行分解。使用100个语言无关的模板，涵盖7种推理类型和22种文化方面，母语注释者创建了与场景对齐的英语和当地语言的多项选择题以及系统推导的真/假问题。Macaron包含11,862个实例，涵盖20个国家/文化背景、10种书写系统和20种语言（包括阿姆哈拉语、约鲁巴语、祖鲁语、吉尔吉斯语以及一些阿拉伯方言等低资源语言）。在对21个多语言大语言模型的零样本评估中，推理模式模型表现最佳，英语和当地语言之间几乎达到平衡，而开放权重模型在当地语言中的表现显著下降，且在真/假任务中常常接近随机猜测。基于文化的数学和计数模板始终是最具挑战性的。数据可在此访问：https://huggingface.co/datasets/AlaaAhmed2444/Macaron。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.10740

Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

领域适应的强化课程预对齐方法用于视觉-语言模型

Yan, Yuming, Yang, Shuo, Tang, Kai, Chen, Sihong, Zhang, Yang, Xu, Ke, Hu, Dan, Yu, Qun, Hu, Pengfei, Ngai, Edith C. H.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

Chinese Translation

视觉-语言模型（VLMs）展现了显著的通用能力，但在医疗影像或几何问题解决等专业领域往往表现不佳。监督微调（SFT）可以提升目标领域的性能，但通常会导致灾难性遗忘，从而限制了其泛化能力。因此，核心挑战在于如何在保持通用能力的同时，将VLMs适应新的领域。持续预训练在大型语言模型（LLMs）中有效扩展知识，但由于计算成本高昂以及大多数开源模型缺乏预训练数据，这在VLMs中可行性较低。这就需要高效的后训练适应方法。基于强化学习（RL）的方法，如群体相对策略优化（GRPO），在保持通用能力方面展现出潜力，但在模型最初缺乏足够领域知识的领域适应场景中，它们往往会失败，导致优化崩溃。为了解决这一问题，我们提出了强化课程预对齐（RCPA），这是一种新颖的后训练范式，引入了关注课程的渐进调制机制。在早期阶段，RCPA对输出施加部分约束，以安全地将模型暴露于新的领域概念中。随着模型对领域的熟悉度增加，训练逐渐过渡到完全生成优化，精炼响应并使其与领域特定偏好对齐。这种分阶段的适应方法在获取领域知识与保持通用多模态能力之间取得了平衡。针对专业领域和通用基准的广泛实验验证了RCPA的有效性，为构建高性能和领域适应的VLMs建立了实用路径。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.10801

Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

基于深度学习的黑箱大型语言模型知识边界表达方法

Sheng, Haotian, Wang, Heyong, Hong, Ming, He, Hongman, Liu, Junqiu

Abstract

Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs' lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model' internal knowledge state, enabling the quantification and expression of the black-box LLM' knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.

Chinese Translation

大型语言模型（LLMs）取得了显著的成功，但内容生成失真（幻觉）的出现限制了它们的实际应用。幻觉的核心原因在于LLMs缺乏对其存储内部知识的意识，无法像人类一样在超出其内部知识边界的问题上表达其知识状态。然而，现有关于知识边界表达的研究主要集中在白箱LLMs上，对于仅提供API访问而不揭示内部参数的黑箱LLMs的适用方法尚未得到充分探索。在此背景下，本文提出了一种名为LSCL（LLM监督置信学习）的基于深度学习的黑箱LLMs知识边界表达方法。该方法基于知识蒸馏框架设计了一个深度学习模型。它以黑箱LLM的输入问题、输出答案和标记概率作为输入，构建输入与模型内部知识状态之间的映射，从而实现对黑箱LLM知识边界的量化和表达。在多样的公共数据集和多个知名黑箱LLM上进行的实验表明，LSCL有效地帮助黑箱LLM准确表达其知识边界，并在准确率和召回率等指标上显著优于现有基线模型。此外，考虑到某些黑箱LLM不支持访问标记概率的场景，本文提出了一种自适应替代方法。该替代方法的性能接近LSCL，并超越了基线模型。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.10816

Beyond Confidence: The Rhythms of Reasoning in Generative Models

超越信心：生成模型中的推理节奏

Liu, Deyuan, Wang, Zecheng, Qin, Zhanyue, Tu, Zhiying, Chu, Dianhui, Sui, Dianbo

Abstract

Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($\delta_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $\delta_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $\delta_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $\delta_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

Chinese Translation

大型语言模型（LLMs）展现出令人印象深刻的能力，但对输入上下文微小变化的敏感性却影响了其可靠性。传统的评估指标如准确率和困惑度无法有效评估局部预测的鲁棒性，因为归一化的输出概率可能掩盖LLM内部状态对扰动的固有韧性。我们提出了令牌约束界限（Token Constraint Bound, $ ext{TCB}$），这是一种新颖的指标，用于量化LLM在其主导的下一个令牌预测显著变化之前能够承受的最大内部状态扰动。$ ext{TCB}$与输出嵌入空间几何形状密切相关，提供了对模型内部预测承诺稳定性的洞察。我们的实验表明，$ ext{TCB}$与有效的提示工程相关，并揭示了在上下文学习和文本生成过程中被困惑度遗漏的关键预测不稳定性。$ ext{TCB}$提供了一种原则性、互补的方法来分析和潜在改善LLM预测的上下文稳定性。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.10832

I can tell whether you are a Native Hawl\^eri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

我能判断你是否是Hawl1ri方言的母语者！人工神经网络、卷积神经网络和递归神经网络在母语识别中的表现

Garari, Hardi, Hassani, Hossein

Abstract

Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewl\^eri, a subdialect spoken in Hewl\^er (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewl\^eri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewl\^eri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.

Chinese Translation

母语识别（NLI）是自然语言处理（NLP）中的一项任务，通常通过作者的写作或说话者的口语来确定其母语。它在法医语言学和一般语言学研究等不同领域有着广泛的应用。尽管在涉及两种不同语言（如英语和德语）的NLI研究中已有相当多的研究，但文献表明在方言和次方言的NLI研究方面存在显著的空白。在资源较少的语言中，例如库尔德语，这一空白更为明显。本研究聚焦于Sorani（中央）库尔德语的一个次方言的NLI，旨在研究Hawl1ri的NLI，这是一种在伊拉克库尔德斯坦地区首府Hawl1r（埃尔比勒）使用的次方言。我们通过录制与40名母语或非母语Hawl1ri说话者（17名女性和23名男性）的访谈，收集了约24小时的语音数据。我们创建了三种基于神经网络的模型：人工神经网络（ANN）、卷积神经网络（CNN）和递归神经网络（RNN），并通过66个实验进行了评估，涵盖了从1到60秒的不同时间段、欠采样、过采样和交叉验证。RNN模型在5秒音频分段中显示出最高的准确率为95.92%，采用80:10:10的数据分割方案。创建的数据集是针对Hawl1ri次方言的首个NLI语音数据集，对各个研究领域都具有潜在的益处。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.10874

C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

C-MOP：集成动量和边界感知聚类以增强提示演化

Yan, Binwei, Fu, Yifei, Zhu, Mingjian, Chen, Hanting, Yuan, Mingxuan, Wang, Yunhe, Hu, Hailin

Abstract

Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features--Hard Negatives, Anchors, and Boundary Pairs--to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at https://github.com/huawei-noah/noah-research/tree/master/C-MOP.

Chinese Translation

自动提示优化是提升大型语言模型（LLMs）性能的一个有前景的方向。然而，现有方法常常受到噪声和冲突更新信号的困扰。在本研究中，我们提出了C-MOP（基于聚类的动量优化提示），这是一个通过边界感知对比采样（BACS）和动量引导的语义聚类（MGSC）来稳定优化的框架。具体而言，BACS利用批量级信息挖掘三元特征——困难负样本、锚点和边界对——以精确表征正负提示样本的典型表示和决策边界。为了解决语义冲突，MGSC引入了一种具有时间衰减的文本动量机制，从迭代过程中的波动梯度中提炼持久共识。大量实验表明，C-MOP在性能上始终优于如PromptWizard和ProTeGi等最先进基线，平均提升分别为1.58%和3.35%。值得注意的是，C-MOP使得一个具有30亿激活参数的通用LLM超越了一个70亿特定领域的稠密LLM，突显了其在推动精确提示演化方面的有效性。代码可在https://github.com/huawei-noah/noah-research/tree/master/C-MOP获取。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.10881

Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

基于大型语言模型的证据提取在元分析中的结构性失败诊断

Tan, Zhiyin, D'Souza, Jennifer

Abstract

Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM-Meta-Analysis).

Chinese Translation

系统评价和元分析依赖于将叙述性文章转化为结构化、以数字为基础的研究记录。尽管大型语言模型（LLMs）迅速发展，但尚不清楚它们是否能够满足这一过程的结构性要求，该要求依赖于在文档之间保持角色、方法和效应大小的归属，而不是识别孤立的实体。我们提出了一种结构性诊断框架，该框架将基于LLM的证据提取评估为一系列具有递增关系和数值复杂性的模式约束查询，从而能够精确识别超越原子级提取的失败点。通过使用涵盖五个科学领域的手动策划语料库，以及统一的查询套件和评估协议，我们在每个文档和长上下文、多文档输入模式下评估了两种最先进的LLM。在各个领域和模型中，单属性查询的性能保持在中等水平，但一旦任务需要在变量、角色、统计方法和效应大小之间保持稳定的绑定，性能则急剧下降。完整的元分析关联元组几乎以零可靠性被提取，而长上下文输入进一步加剧了这些失败。下游聚合放大了即使是微小的上游错误，使得语料库级统计数据不可靠。我们的分析表明，这些局限性并非源于实体识别错误，而是由于系统性的结构性崩溃，包括角色反转、交叉分析绑定漂移、密集结果部分的实例压缩和数值错误归属，表明当前的LLM缺乏进行自动化元分析所需的结构保真度、关系绑定和数值基础。代码和数据已在GitHub上公开（https://github.com/zhiyintan/LLM-Meta-Analysis）。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.10886

The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

CLEF-2026 FinMMEval 实验室：金融人工智能系统的多语言和多模态评估

Xie, Zhuohan, Elbadry, Rania, Zhang, Fan, Georgiev, Georgi, Peng, Xueqing, Qian, Lingfei, Huang, Jimin, Dimitrov, Dimitar, Jani, Vanshikaa, Dai, Yuyang, Geng, Jiahui, Wang, Yuxia, Koychev, Ivan, Stoyanov, Veselin, Nakov, Preslav

Abstract

We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models' ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.

Chinese Translation

我们介绍了 CLEF 2026 FinMMEval 实验室的设置和任务，该实验室引入了首个针对金融大型语言模型（LLMs）的多语言和多模态评估框架。尽管近年来金融自然语言处理的进展使得市场报告、监管文件和投资者沟通的自动化分析成为可能，但现有基准仍然主要是单语的、仅限文本的，并且局限于狭窄的子任务。FinMMEval 2026 通过提供三个相互关联的任务来填补这一空白，这些任务涵盖了金融理解、推理和决策：金融考试问答、跨语言金融问答（PolyFiQA）和金融决策。综合来看，这些任务提供了一个全面的评估套件，用于衡量模型在不同语言和模态下的推理、概括和行动能力。该实验室旨在促进强大、透明和全球包容的金融人工智能系统的发展，并公开发布数据集和评估资源，以支持可重复的研究。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2602.10908

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2：一种快速且柔性模式匹配器，用于万亿规模语料库

Yoneda, Masataka, Matsushita, Yusuke, Kamoda, Go, Suenaga, Kohei, Akiba, Takuya, Waga, Masaki, Yokoi, Sho

Abstract

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

Chinese Translation

我们提出了一种超快速且灵活的搜索算法，能够在不到0.3秒的时间内对万亿规模的自然语言语料库进行搜索，同时处理语义变体（替换、插入和删除）。我们的方法基于后缀数组的字符串匹配，能够很好地随着语料库规模的扩大而扩展。为了减轻由于查询的语义放宽所引起的组合爆炸，我们的方法建立在两个关键算法思想之上：通过磁盘感知设计实现的快速精确查找，以及动态语料库感知剪枝。我们理论上证明，所提出的方法通过利用自然语言的统计特性，抑制了搜索空间相对于查询长度的指数增长。在对FineWeb-Edu (Lozhkov et al., 2024)（14亿个标记）的实验中，我们展示了我们的方法在搜索延迟方面显著低于现有方法：infini-gram (Liu et al., 2024)、infini-gram mini (Xu et al., 2025) 和 SoftMatcha (Deguchi et al., 2025)。作为实际应用，我们展示了我们的方法能够识别训练语料库中的基准污染，而这一点在现有方法中未被识别。我们还提供了一个在线演示，展示了在七种语言中快速、柔性的跨语料库搜索。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2602.10947

Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

自闭症时间体验的计算现象学：量化生活中不可预测性的情感和叙事特征

Dudzic, Kacper, Drożdż, Karolina, Wodziński, Maciej, Szuła, Anastazja, Moskalewicz, Marcin

Abstract

Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the "Immediacy & Suddenness" category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

Chinese Translation

时间性障碍，如与社会环境的不同步及其不可预测性，被认为是自闭症的核心特征，对人际关系产生深远影响。然而，关于这一问题的研究存在以下局限性：1）以缺陷为基础的自闭症医学模型的主导地位，2）定性研究中的样本规模，3）计算研究中缺乏现象学的锚定。为了弥合现象学与计算方法之间的差距，并克服样本规模的限制，我们的研究整合了三种方法。研究A：使用跨诊断时间体验评估工具对自闭症个体进行结构化现象学访谈。研究B：对为此目的构建的自闭症叙事的自传体语料库进行计算分析。研究C：使用叙事流量测量来评估自闭症自传的感知现象学真实性的计算研究复制。访谈揭示，自闭症组与对照组之间最显著的差异在于体验的不可预测性。计算结果反映了这些发现：自闭症叙事中的时间词汇显著更具负面情感，特别是在“即时性与突发性”类别中。离群值分析识别出与感知不连续性相关的术语（不可预测地、突然地和急剧地）为高度负面。叙事流量的计算分析发现，语料库中的自闭症叙事在量化上更像自传故事而非虚构故事。总体而言，自闭症个体所经历的时间挑战主要涉及生活中的不可预测性，源于生活体验的内容，而非自闭症叙事的构建。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2602.10953

Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

搜索或加速：用于扩散语言模型的置信度切换位置束搜索

Cao, Mingyu, Correia, Alvaro, Louizos, Christos, Liu, Shiwei, Yin, Lu

Abstract

Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

Chinese Translation

扩散语言模型（DLMs）通过迭代去噪一个被掩蔽的序列来生成文本，反复决定在每一步中要承诺哪些位置。标准解码遵循贪婪规则：解掩最有信心的位置，但这种局部选择可能会使模型锁定在一个次优的解掩顺序上，特别是在推理密集的提示中。我们提出了SOAR，这是一种无训练的解码算法，它根据模型的不确定性调整其行为。当置信度较低时，SOAR会短暂扩大对替代解掩决策的搜索，以避免过早承诺；当置信度较高时，它会收缩搜索并并行解码多个位置，以减少去噪迭代的次数。在Dream-7B和LLaDA-8B上的数学推理和代码生成基准（GSM8K、MBPP、HumanEval）中，SOAR提高了生成质量，同时保持了竞争力的推理速度，为在DLM解码中平衡质量和效率提供了一种实用的方法。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2602.10993

LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

LoRA-Squeeze：简单有效的LoRA模块后调优和内调优压缩

Vulić, Ivan, Grycner, Adam, de Laroussilhe, Quentin, Pfeiffer, Jonas

Abstract

Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

Chinese Translation

尽管标准低秩适应（Low-Rank Adaptation, LoRA）有众多变体，但仍然是参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的主流技术。然而，它面临着持续的挑战，包括最优秩和特定秩超参数的预选，以及异构秩模块和更复杂的LoRA衍生物的部署复杂性。在本研究中，我们提出了LoRA-Squeeze，一种简单而高效的方法，旨在通过在训练过程中动态或事后改变LoRA模块的秩来改善标准LoRA学习。我们的方法认为，首先学习一个表现力强的高秩解决方案，然后再进行压缩，优于直接学习一个受限的低秩解决方案。该方法涉及使用故意较高的源秩进行微调，重构或高效近似全权重更新矩阵的重构，然后使用随机奇异值分解（Randomized Singular Value Decomposition, RSVD）在较低目标秩下创建一个新的压缩LoRA模块。在13个文本和10个视觉语言任务上的大量实验表明，事后压缩通常产生的低秩适配器优于那些直接在目标秩下训练的适配器，特别是在允许在目标秩下进行少量微调步骤的情况下。此外，LoRA-Squeeze的逐步内调秩退火变体始终实现了最佳的LoRA规模与性能的权衡。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2602.11028

Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

德智库皮特语料库中早期认知衰退的语言指标：一项统计与机器学习研究

Avetisyan, Artsvik, Kumar, Sachin

Abstract

Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

Chinese Translation

背景：自发语言产生中的微妙变化是认知衰退最早的指示之一。识别具有语言可解释性的痴呆标记可以支持透明且以临床为基础的筛查方法。方法：本研究分析了来自德智库皮特语料库的自发言语转录，使用三种语言表示方式：原始清理文本、结合词汇和语法信息的词性（POS）增强表示，以及仅包含词性的句法表示。采用逻辑回归和随机森林模型在两种协议下进行评估：转录级别的训练-测试分割和受试者级别的五折交叉验证，以防止说话者重叠。使用全局特征重要性检查模型可解释性，并通过Mann-Whitney U检验和Cliff's delta效应量进行统计验证。结果：在各种表示中，模型表现稳定，句法和语法特征即使在缺乏词汇内容的情况下也保持强大的区分能力。受试者级别的评估结果更加保守但一致，特别是对于词性增强和仅词性表示。统计分析显示功能词使用、词汇多样性、句子结构和话语连贯性在组间存在显著差异，这与机器学习特征重要性发现密切相关。结论：结果表明，抽象语言特征在临床现实评估下捕捉到了早期认知衰退的稳健标记。通过将可解释的机器学习与非参数统计验证相结合，本研究支持使用基于语言的特征进行透明和可靠的语言基础认知筛查。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2602.11044

Language Model Inversion through End-to-End Differentiation

通过端到端微分实现语言模型反演

Denamganaï, Kevin Yandoka, Subr, Kartic

Abstract

Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).

Chinese Translation

尽管关于语言模型（Language Models, LM）的研究不断涌现，但对语言模型可逆性的分析方法仍然较少。具体而言，给定一个语言模型和一个期望的目标输出序列，确定哪些输入提示能够产生该目标输出仍然是一个未解决的问题。我们将此问题表述为经典的基于梯度的优化问题。首先，我们提出了一种简单的算法，以实现给定（冻结）语言模型的端到端微分，然后通过梯度下降找到优化的提示。我们的核心见解是将语言模型视为在令牌分布序列上操作的函数（而不是传统上视为在令牌序列上操作的函数）。我们的实验和消融研究表明，我们基于DLM的反演可以可靠且高效地优化长度为10和80的提示，以达到长度为20的目标，适用于多种白盒语言模型（开箱即用）。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2602.11047

Embedding Inversion via Conditional Masked Diffusion Language Models

通过条件掩蔽扩散语言模型进行嵌入反演

Xiao, Han

Abstract

We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.

Chinese Translation

我们将嵌入反演框架化为条件掩蔽扩散，通过迭代去噪以并行方式恢复所有标记，而不是采用顺序自回归生成。掩蔽扩散语言模型通过自适应层归一化以目标嵌入为条件，仅需在一个78M参数的模型中进行8次前向传播，而无需访问目标编码器。在三个嵌入模型的32个标记序列上，该方法实现了81.3%的标记准确率和0.87的余弦相似度。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2602.11065

Conversational Behavior Modeling Foundation Model With Multi-Level Perception

具有多层感知的对话行为建模基础模型

Zhou, Dingkun, Pan, Shuchang, Lian, Jiachen, Banerjee, Siddharth, Pasumarthy, Sarika, Hebbar, Dhruv, Patel, Siddhant, Li, Zeyi Austin, Cheng, Kan Jen, Bordia, Sanay, Patel, Krish, Gupta, Akshaj, Li, Tingle, Anumanchipalli, Gopala

Abstract

Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

Chinese Translation

人类对话是由隐含的思维链组织而成，这种思维链表现为定时的言语行为。捕捉这一感知路径是构建自然全双工交互系统的关键。我们提出了一个将这一过程建模为多层感知的框架，并通过思想图（Graph-of-Thoughts, GoT）推理对话行为。我们的方法通过分层标注方案形式化意图到行动的路径，预测高层次的交际意图和低层次的言语行为，以学习它们的因果和时间依赖关系。为了训练该系统，我们开发了一个高质量的语料库，将可控的、事件丰富的对话数据与人工标注的标签配对。GoT框架将流式预测结构化为一个不断演变的图，使变换器能够预测下一个言语行为，为其决策生成简明的理由，并动态地完善其推理。在合成和真实的双工对话上的实验表明，该框架提供了稳健的行为检测，产生可解释的推理链，并为全双工口语对话系统中的对话推理基准建立了基础。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2602.11072

Simultaneous Speech-to-Speech Translation Without Aligned Data

无对齐数据的同步语音到语音翻译

Labiausse, Tom, Fabre, Romain, Estève, Yannick, Défossez, Alexandre, Zeghidour, Neil

Abstract

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

Chinese Translation

同步语音翻译要求在实时情况下将源语音翻译成目标语言，同时处理非单调的词依赖关系。传统方法依赖于带有词级对齐数据的监督训练，这种数据难以大规模收集，因此依赖于使用特定语言启发式的合成对齐，这些方法并不理想。我们提出了Hibiki-Zero，完全消除了对词级对齐的需求。这从根本上简化了训练流程，并使得能够无缝扩展到具有不同语法结构的多种语言，消除了设计特定语言对齐启发式的瓶颈。我们首先在句子级对齐数据上进行训练，以学习高延迟下的语音翻译，然后应用一种新颖的强化学习策略，使用GRPO优化延迟，同时保持翻译质量。Hibiki-Zero在翻译准确性、延迟、语音转移和自然性方面，在五个X到英语的任务中达到了最先进的性能。此外，我们展示了我们的模型可以在少于1000小时的语音数据下适应支持新的输入语言。我们提供示例、模型权重、推理代码，并发布了一个包含45小时多语言数据的基准，用于语音翻译评估。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2602.11081

SteuerLLM: Local specialized large language model for German tax law analysis

SteuerLLM：用于德国税法分析的本地专业大型语言模型

Wind, Sebastian, Sopa, Jeta, Schmid, Laurin, Jackl, Quirin, Kiefer, Sebastian, Wu, Fei, Mayr, Martin, Köstler, Harald, Wellein, Gerhard, Maier, Andreas, Arasteh, Soroosh Tayebi

Abstract

Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.

Chinese Translation

大型语言模型（LLMs）展现出强大的通用推理和语言理解能力，但在受严格形式规则、精确术语和法律约束结构支配的领域，其表现却有所下降。税法正是这些挑战的典型例子，因为正确的答案需要准确的法条引用、结构化的法律论证以及在严格评分标准下的数字准确性。我们算法生成了SteuerEx，这是第一个基于真实德国大学税法考试的开放基准。SteuerEx包含115个经过专家验证的考试问题，涵盖六个核心税法领域和多个学术层次，并采用一种与真实考试实践紧密相似的陈述级部分评分评估框架。我们进一步提出了SteuerLLM，这是一种针对德国税法的领域适应型LLM，基于从真实考试材料生成的大规模合成数据集进行训练，采用了受控的检索增强管道。SteuerLLM（28B参数）在性能上始终优于同类规模的通用指令调优模型，并且在多个案例中显著超越更大规模的系统，证明了领域特定数据和架构适应性在现实法律推理任务中的表现上比参数规模更为关键。所有基准数据、训练数据集、模型权重和评估代码均已开放发布，以支持领域特定法律人工智能的可重复研究。SteuerLLM的基于网络的演示可在 https://steuerllm.i5.ai.fau.de 获得。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2602.11089

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

DataChef：通过强化学习为大语言模型适应烹饪最佳数据配方

Chen, Yicheng, Ma, Zerun, Xie, Xinchen, Li, Yining, Chen, Kai

Abstract

In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

Chinese Translation

在当前的大语言模型（LLMs）环境中，大规模高质量训练数据的策划是模型性能的主要驱动因素。一个关键的杠杆是 extit{数据配方}，它包括一个数据处理管道，将原始数据源转化为训练语料库。尽管越来越多地使用LLMs来自动化单个数据处理步骤，如数据合成和过滤，但数据配方的整体设计仍然主要是手动和劳动密集型的，需耗费大量人力专业知识和迭代。为了解决这一问题，我们提出了 extit{端到端数据配方生成}用于LLM适应。给定一个目标基准和一组可用数据源，模型需要输出一个完整的数据配方，以将基础LLM适应于目标任务。我们展示了DataChef-32B，它使用代理奖励进行在线强化学习，以预测候选配方的下游性能。在六个保留任务中，DataChef-32B生成的实际配方在下游性能上与人类专家策划的配方相当。值得注意的是，DataChef-32B的配方将Qwen3-1.7B-Base适应于数学领域，在AIME'25上取得了66.7的成绩，超越了Qwen3-1.7B。这项工作为自动化LLM训练和开发自我进化的人工智能系统提供了新的视角。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2602.11091

Can Large Language Models Make Everyone Happy?

大型语言模型能让每个人都快乐吗？

Naseem, Usman, Kashyap, Gautam Siddharth, Shabbir, Ebad, Ray, Sushant Kumar, Mohammad, Abdullah, Ali, Rafiq

Abstract

Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

Chinese Translation

大型语言模型（LLMs）中的不一致性指的是未能同时满足安全、价值和文化维度，导致在这些维度必须共存的现实环境中，行为偏离人类期望。现有的基准测试，如 SAFETUNEBED（以安全为中心）、VALUEBENCH（以价值为中心）和 WORLDVIEW-BENCH（以文化为中心），主要孤立地评估这些维度，因此对它们之间的相互作用和权衡提供的见解有限。最近的努力，包括基于机械解释性的 MIB 和 INTERPRETABILITY BENCHMARK，提供了关于模型失败的有价值视角；然而，它们仍不足以系统性地表征跨维度的权衡。为了解决这些问题，我们引入了 MisAlign-Profile，一个统一的基准，用于测量受机械剖析启发的不一致性权衡。首先，我们构建了 MISALIGNTRADE，一个涵盖 112 个规范领域分类的英语不一致-一致数据集，包括 14 个安全领域、56 个价值领域和 42 个文化领域。除了领域标签外，每个提示还通过 Gemma-2-9B-it 被分类为三种正交语义类型之一——对象、不属性或关系不一致——并通过 Qwen3-30B-A3B-Instruct-2507 进行扩展，使用基于 SimHash 的指纹技术以避免重复。每个提示通过两阶段拒绝采样与不一致和一致的响应配对，以确保质量。其次，我们在 MISALIGNTRADE 上对通用、微调和开放权重的 LLM 进行了基准测试，揭示了跨维度的 12%-34% 不一致性权衡。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2602.11096

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

推理模型中的安全恢复仅需少数早期引导步骤

Ghosal, Soumya Suvra, Chakraborty, Souradip, Singh, Vaibhav, Huang, Furong, Manocha, Dinesh, Bedi, Amrit Singh

Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

Chinese Translation

基于强化学习（RL）的后训练显式思维链（例如，GRPO）提高了多模态大规模推理模型（MLRMs）的推理能力。但最近的证据表明，这可能同时降低安全对齐并增加越狱成功率。我们提出了SafeThink，这是一种轻量级推理时防御机制，将安全恢复视为一个满足约束而非最大化目标。SafeThink通过安全奖励模型监控不断演变的推理轨迹，并在安全阈值被违反时有条件地注入一个优化的短纠正前缀（“等一下，安全思考”）。在我们对六个开源MLRMs和四个越狱基准（JailbreakV-28K、Hades、FigStep和MM-SafetyBench）的评估中，SafeThink将攻击成功率降低了30-60%（例如，LlamaV-o1在JailbreakV-28K上的成功率从63.33%降至5.74%，R1-Onevision在Hades上的成功率从69.07%降至5.65%），同时保持了推理性能（MathVista准确率从65.20%降至65.00%）。我们实验的一个关键实证发现是，安全恢复通常只需少数引导步骤：在前1-3个推理步骤中进行干预通常足以将整个生成过程引导至安全的完成。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2602.11106

TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

TEGRA：用于虚假信息检测的图形和检索增强文本编码

Faye, Géraud, Ouerdane, Wassila, Gadek, Guillaume, Hudelot, Céline

Abstract

Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.

Chinese Translation

虚假信息检测是一项关键任务，可以从外部知识的整合中显著受益，类似于人工事实核查。在本研究中，我们提出了一种新颖的文本文件表示方法，便于从知识库中引入信息。我们的方法，图形文本编码（Text Encoding with Graph, TEG），通过提取以图形形式呈现的结构化信息来处理文档，并对文本和图形进行编码以用于分类。通过大量实验，我们证明这种混合表示相比单独使用语言模型能够提升虚假信息检测的性能。此外，我们引入了TEGRA，这是我们框架的一个扩展，整合了特定领域的知识，在大多数情况下进一步提高了分类准确性。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2602.11149

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

数据重复优于数据扩展在长链思维监督微调中的应用

Kopiczko, Dawid J., Vaze, Sagar, Blankevoort, Tijmen, Asano, Yuki M.

Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Chinese Translation

在链思维数据上进行监督微调（SFT）是推理语言模型的重要后训练步骤。标准的机器学习直觉认为，使用更多独特的训练样本进行训练能够获得更好的泛化效果。然而，反直觉的是，我们展示了SFT从重复中获益：在固定的更新预算下，在较小数据集上训练更多的轮次优于在较大数据集上进行单轮训练。在AIME'24/25和GPQA基准测试中，Olmo3-7B在400个样本上训练128轮的表现比在51200个样本上训练1轮高出12-26个百分点，且没有额外的灾难性遗忘。我们发现，训练令牌准确率可靠地指示了重复何时达到饱和；在完全记忆的情况下，额外轮次的改进趋于平稳，这一模式在所有设置中一致。这些发现为推理SFT提供了一种实用的方法，其中以令牌准确率作为停止标准来扩展轮次，可以替代昂贵的无指导数据扩展。我们将重复优势，即完全记忆与改进的泛化相一致，提出为理解大型语言模型训练动态的一个新的开放问题。

View on arXiv Download PDF AI Translation

arXiv Papers

Adaptive Time Step Flow Matching for Autonomous Driving Motion Planning

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

Solving Geodesic Equations with Composite Bernstein Polynomials for Trajectory Planning

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Co-jump: Cooperative Jumping with Quadrupedal Robots via Multi-Agent Reinforcement Learning

ReSPEC: A Framework for Online Multispectral Sensor Reconfiguration in Dynamic Environments

LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

Morphogenetic Assembly and Adaptive Control for Heterogeneous Modular Robots

Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning

Pitch Angle Control of a Magnetically Actuated Capsule Robot with Nonlinear FEA-based MPC and EKF Multisensory Fusion

Free-Flying Crew Cooperative Robots on the ISS: A Joint Review of Astrobee, CIMON, and Int-Ball Operations

3D-Printed Anisotropic Soft Magnetic Coating for Directional Rolling of a Magnetically Actuated Capsule Robot

A Unified Experimental Architecture for Informative Path Planning: from Simulation to Deployment with GuadalPlanner

Omnidirectional Dual-Arm Aerial Manipulator with Proprioceptive Contact Localization for Landing on Slanted Roofs

Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Biomimetic Mantaray robot toward the underwater autonomous -- Experimental verification of swimming and diving by flapping motion -

Safe mobility support system using crowd mapping and avoidance route planning using VLM

Design, Development, and Use of Maya Robot as an Assistant for the Therapy/Education of Children with Cancer: a Pilot Study

Developing Neural Network-Based Gaze Control Systems for Social Robots

Stability Analysis of Geometric Control for a Canonical Class of Underactuated Aerial Vehicles with Spurious Forces

RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

Scaling World Model for Hierarchical Manipulation Policies

Multi-Task Reinforcement Learning of Drone Aerobatics by Exploiting Geometric Symmetries

ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

RISE: Self-Improving Robot Policy with Compositional World Model

Digging for Data: Experiments in Rock Pile Characterization Using Only Proprioceptive Sensing in Excavation

A receding-horizon multi-contact motion planner for legged robots in challenging environments

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

YOR: Your Own Mobile Manipulator for Generalizable Robotics

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

MPA: Multimodal Prototype Augmentation for Few-Shot Learning

VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

PMMA: The Polytechnique Montreal Mobility Aids Dataset

Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

Flow Matching with Uncertainty Quantification and Guidance

Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

Towards Remote Sensing Change Detection with Neural Memory

End-to-End LiDAR optimization for 3D point cloud registration

Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

Dynamic Frequency Modulation for Controllable Text-driven Image Generation

AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection