arXiv Daily Digest

632

Papers

Rationale Behind Human-Led Autonomous Truck Platooning

人主导的自主卡车编队背后的理由

Lu, Yukun, Li, Chenzhao, Jiang, Xintong, Zhang, Qiaoxuan

Abstract

Autonomous trucking has progressed rapidly in recent years, transitioning from early demonstrations to OEM-integrated commercial deployments. However, fully driverless freight operations across heterogeneous climates, infrastructure conditions, and regulatory environments remain technically and socially challenging. This paper presents a systematic rationale for human-led autonomous truck platooning as a pragmatic intermediate pathway. First, we analyze 53 major truck accidents across North America (2021-2026) and show that human-related factors remain the dominant contributors to severe crashes, highlighting both the need for advanced assistance/automated driving systems and the complexity of real-world driving environments. Second, we review recent industry developments and identify persistent limitations in long-tail edge cases, winter operations, remote-region logistics, and large-scale safety validation. Based on these findings, we argue that a human-in-the-loop (HiL) platooning architecture offers layered redundancy, adaptive judgment in uncertain conditions, and a scalable validation framework. Furthermore, the dual-use capability of follower vehicles enables an evolutionary transition from coordinated platooning to independent autonomous operation. Rather than representing a compromise, human-led platooning provides a technically grounded and societally aligned bridge toward large-scale autonomous freight deployment.

Chinese Translation

自主卡车运输在近年来迅速发展，从早期的演示过渡到OEM集成的商业部署。然而，在异构气候、基础设施条件和监管环境下实现完全无驾驶员的货运操作仍然在技术和社会上面临挑战。本文系统性地提出了人主导的自主卡车编队作为一种务实的中间路径。首先，我们分析了2021年至2026年间北美53起重大卡车事故，显示与人相关的因素仍然是严重碰撞的主要原因，突显了对先进辅助/自动驾驶系统的需求以及现实驾驶环境的复杂性。其次，我们回顾了近期行业发展，并识别出在长尾边缘案例、冬季操作、偏远地区物流和大规模安全验证方面的持续局限性。基于这些发现，我们认为人机协同（HiL）编队架构提供了分层冗余、不确定条件下的自适应判断以及可扩展的验证框架。此外，跟随车辆的双重用途能力使得从协调编队向独立自主操作的演变成为可能。人主导的编队不仅不是一种妥协，而是朝向大规模自主货运部署的技术基础和社会一致的桥梁。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2603.13313

MRPoS: Mixed Reality-Based Robot Navigation Interface Using Spatial Pointing and Speech with Large Language Model

MRPoS：基于混合现实的机器人导航接口，结合空间指向和大语言模型的语音交互

Iglesius, Eduardo, Kobayashi, Masato, Uranishi, Yuki

Abstract

Recent advancements have made robot navigation more intuitive by transitioning from traditional 2D displays to spatially aware Mixed Reality (MR) systems. However, current MR interfaces often rely on manual "air tap" gestures for goal placement, which can be repetitive and physically demanding, especially for beginners. This paper proposes the Mixed Reality-Based Robot Navigation Interface using Spatial Pointing and Speech (MRPoS). This novel framework replaces complex hand gestures with a natural, multimodal interface combining spatial pointing with Large Language Model (LLM)-based speech interaction. By leveraging both information, the system translates verbal intent into navigation goals visualized by MR technology. Comprehensive experiments comparing MRPoS against conventional gesture-based systems demonstrate that our approach significantly reduces task completion time and workload, providing a more accessible and efficient interface. For additional material, please check: https://mertcookimg.github.io/mrpos

Chinese Translation

最近的进展使得机器人导航变得更加直观，传统的2D显示转向了具有空间感知能力的混合现实（MR）系统。然而，目前的MR接口通常依赖于手动“空中点击”手势来进行目标放置，这对于初学者来说可能是重复且身体上要求较高的。本文提出了一种基于混合现实的机器人导航接口，结合空间指向和语音交互（MRPoS）。这一新颖的框架用自然的多模态接口替代了复杂的手势，结合了空间指向和基于大语言模型（LLM）的语音交互。通过利用这两种信息，系统将口头意图转化为通过MR技术可视化的导航目标。全面的实验比较了MRPoS与传统基于手势的系统，结果表明我们的方法显著减少了任务完成时间和工作负荷，提供了一个更易于访问和高效的接口。有关更多材料，请访问：https://mertcookimg.github.io/mrpos

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2603.13315

Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning via Subtask-Level Progress Rate and Keyframe Memory for Long-Horizon Contact-Rich Robotic Manipulation

Bi-HIL：基于双向控制的多模态层次模仿学习框架，通过子任务进度率和关键帧记忆实现长时间接触丰富的机器人操作

Buamanee, Thanpimon, Kobayashi, Masato, Uranishi, Yuki

Abstract

Long-horizon contact-rich robotic manipulation remains challenging due to partial observability and unstable subtask transitions under contact uncertainty. While hierarchical architectures improve temporal reasoning and bilateral imitation learning enables force-aware control, existing approaches often rely on flat policies that struggle with long-horizon coordination. We propose Bi-HIL, a bilateral control-based multimodal hierarchical imitation learning framework for long-horizon manipulation. Bi-HIL stabilizes hierarchical coordination by integrating keyframe memory with subtask-level progress rate that models phase progression within the active subtask and conditions both high- and low-level policies. We evaluate Bi-HIL on unimanual and bimanual real-robot tasks, demonstrating consistent improvements over flat and ablated variants. The results highlight the importance of explicitly modeling subtask progression together with force-aware control for robust long-horizon manipulation. For additional material, please check: https://mertcookimg.github.io/bi-hil

Chinese Translation

长时间接触丰富的机器人操作由于局部可观测性和在接触不确定性下的不稳定子任务转变而仍然具有挑战性。虽然层次结构改善了时间推理，双向模仿学习则实现了基于力的控制，但现有方法往往依赖于平坦策略，这使得长时间协调变得困难。我们提出了Bi-HIL，一种基于双向控制的多模态层次模仿学习框架，旨在解决长时间操作问题。Bi-HIL通过将关键帧记忆与建模当前子任务阶段进度的子任务级进度率相结合，从而稳定层次协调，并为高层和低层策略提供条件支持。我们在单臂和双臂真实机器人任务上评估了Bi-HIL，结果显示其相较于平坦及删减版本有一致的改进。这些结果突显了为实现鲁棒的长时间操作，显式建模子任务进度与基于力的控制的重要性。有关其他材料，请查看：https://mertcookimg.github.io/bi-hil

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2603.13333

STL-SVPIO: Signal Temporal Logic guided Stein Variational Path Integral Optimization

STL-SVPIO：信号时序逻辑引导的斯坦因变分路径积分优化

Zheng, Hongrui, Zang, Zirui, Amine, Ahmad, Vasile, Cristian Ioan, Mangharam, Rahul

Abstract

Signal Temporal Logic (STL) enables formal specification of complex spatiotemporal constraints for robotic task planning. However, synthesizing long-horizon continuous control trajectories from complex STL specifications is fundamentally challenging due to the nested structure of STL robustness objectives. Existing solver-based methods, such as Mixed-Integer Linear Programming (MILP), suffer from exponential scaling, whereas sampling methods, such as Model-Predictive Path Integral control (MPPI), struggle with sparse, long-horizon costs. We introduce Signal Temporal Logic guided Stein Variational Path Integral Optimization (STL-SVPIO), which reframes STL as a globally informative, differentiable reward-shaping mechanism. By leveraging Stein Variational Gradient Descent and differentiable physics engines, STL-SVPIO transports a mutually repulsive swarm of control particles toward high robustness regions. Our method transforms sparse logical satisfaction into tractable variational inference, mitigating the severe local minima traps of standard gradient-based methods. We demonstrate that STL-SVPIO significantly outperforms existing methods in both robustness and efficiency for traditional STL tasks. Moreover, it solves complex long-horizon tasks, including multi-agent coordination with synchronization and queuing while baselines either fail to discover feasible solutions, or become computationally intractable. Finally, we use STL-SVPIO in agile robotic motion planning tasks with nonlinear dynamics, such as 7-DoF manipulation and half cheetah back flips to show the generalizability of our algorithm.

Chinese Translation

信号时序逻辑（STL）能够对机器人任务规划中的复杂时空约束进行形式化规范。然而，由于STL鲁棒性目标的嵌套结构，从复杂的STL规范合成长时间跨度的连续控制轨迹在根本上具有挑战性。现有的基于求解器的方法，如混合整数线性规划（MILP），在规模上呈指数增长，而采样方法，如模型预测路径积分控制（MPPI），在稀疏的长时间跨度成本下表现不佳。我们提出了信号时序逻辑引导的斯坦因变分路径积分优化（STL-SVPIO），将STL重新构建为一种全局信息丰富的可微奖励塑造机制。通过利用斯坦因变分梯度下降和可微物理引擎，STL-SVPIO将一群相互排斥的控制粒子引导至高鲁棒性区域。我们的方法将稀疏的逻辑满足转化为可处理的变分推断，从而减轻了标准基于梯度的方法的严重局部最小值陷阱。我们展示了STL-SVPIO在鲁棒性和效率上显著优于现有方法，适用于传统的STL任务。此外，它能够解决复杂的长时间跨度任务，包括多智能体协调与同步和排队，而基线方法要么无法发现可行解，要么变得计算上不可处理。最后，我们在具有非线性动态的敏捷机器人运动规划任务中使用STL-SVPIO，例如7自由度操作和半猎豹后空翻，以展示我们算法的普适性。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2603.13433

Spatially Grounded Long-Horizon Task Planning in the Wild

野外环境中空间基础的长时间任务规划

Jung, Sehun, Song, HyunJee, Kim, Dong-Hee, Tan, Reuben, Gao, Jianfeng, Lee, Yong Jae, Kim, Donghyun

Abstract

Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.

Chinese Translation

近年来，机器人操作领域的进展越来越多地利用视觉-语言模型（Vision-Language Models, VLMs）进行高层次推理，例如将任务指令分解为以自然语言表达的顺序行动计划，从而指导后续的低层次运动执行。然而，目前的基准测试并未评估这些计划在空间上的可执行性，特别是在指定机器人应与之交互以执行计划的确切空间位置方面，这限制了对现实世界操作能力的评估。为了解决这一问题，我们定义了一项新的基础规划任务，并引入了GroundedPlanBench，这是一个新策划的基准，用于在野外环境中进行空间基础的长时间行动规划。GroundedPlanBench联合评估层次子行动规划和空间行动基础（即行动地点），使得能够系统地评估生成的子行动是否在空间上可执行以进行机器人操作。我们进一步引入了视频到空间基础规划（Video-to-Spatially Grounded Planning, V2GP），这是一个利用现实世界机器人视频演示来改善空间基础长时间规划的自动数据生成框架。我们的评估结果显示，空间基础的长时间规划仍然是当前VLMs的一个主要瓶颈。我们的结果表明，V2GP为改善行动规划和空间基础表现提供了一种有前景的方法，这在我们的基准测试以及通过现实世界机器人操作实验中得到了验证，推动了朝向空间可操作规划的进展。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2603.13502

Safety-guaranteed and Goal-oriented Semantic Sensing, Communication, and Control for Robotics

安全保障与目标导向的语义感知、通信与控制在机器人技术中的应用

Wu, Wenchao, Chen, Shutong, Liu, Wenjie, Pang, Zhibo, Deng, Yansha

Abstract

Wirelessly-connected robotic system empowers robots with real-time intelligence by leveraging remote computing resources for decision-making. However, the data exchange between robots and base stations often overwhelms communication links, introducing latency that undermines real-time response. To tackle this, goal-oriented semantic communication (GSC) has been introduced into wirelessly-connected robotic systems to extract and transmit only goal-relevant semantic representations, enhancing communication efficiency and task effectiveness. However, existing GSC approaches focused primarily on optimizing effectiveness metrics while overlooking safety requirements, which should be treated as the top priority in real-world robotic systems. To bridge this gap, we propose safety-guaranteed and goal-oriented semantic communication for wirelessly-connected robotic system, aiming to maximize the robotic task effectiveness subject to practical operational safety requirements. We first summarize the general safety requirements and effectiveness metrics across typical robotic tasks, including robot arm grasping, unmanned aerial vehicle (UAV)-assisted tasks, and multi-robot exploration. We then systematically analyze the unique safety and effectiveness challenges faced by wirelessly-connected robotic system in sensing, communication, and control. Based on these, we further present potential safety-guaranteed and goal-oriented sensing, communication, and control solutions. Finally, a UAV target tracking case study validates that our proposed GSC solutions can significantly improve safety rate and tracking success rate by more than 2 times and 4.5 times, respectively.

Chinese Translation

无线连接的机器人系统通过利用远程计算资源进行决策，为机器人赋予实时智能。然而，机器人与基站之间的数据交换常常超出通信链路的承载能力，导致延迟，从而削弱实时响应能力。为了解决这一问题，目标导向的语义通信（Goal-oriented Semantic Communication, GSC）被引入无线连接的机器人系统中，以提取和传输仅与目标相关的语义表示，从而提高通信效率和任务有效性。然而，现有的GSC方法主要集中于优化有效性指标，而忽视了安全要求，而安全性应被视为现实世界机器人系统中的首要任务。为填补这一空白，我们提出了一种安全保障与目标导向的语义通信方案，旨在最大化机器人任务的有效性，同时满足实际操作的安全要求。我们首先总结了典型机器人任务中的一般安全要求和有效性指标，包括机器人手臂抓取、无人机（UAV）辅助任务和多机器人探索。然后，我们系统地分析了无线连接的机器人系统在感知、通信和控制方面面临的独特安全性和有效性挑战。在此基础上，我们进一步提出了潜在的安全保障与目标导向的感知、通信和控制解决方案。最后，通过无人机目标跟踪案例研究验证了我们提出的GSC解决方案能够显著提高安全率和跟踪成功率，分别超过2倍和4.5倍。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2603.13528

Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis

通过反事实失败合成学习可操作的操控恢复

Li, Dayou, Lei, Jiuzhou, Wang, Hao, Liu, Lulin, Yang, Yunhao, Wang, Zihan, Liu, Bangya, Zheng, Minghui, Fan, Zhiwen

Abstract

While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.

Chinese Translation

尽管近期的基础模型在机器人操控方面取得了显著进展，但这些系统在自主从执行错误中恢复方面仍然面临挑战。目前的失败学习范式依赖于昂贵且不安全的现实世界数据收集或基于模拟器的扰动，这引入了严重的仿真与现实之间的差距。此外，现有的视觉分析工具主要输出粗略的二元诊断，而不是实际恢复所需的可执行轨迹级别修正。为了弥合失败诊断与可操作恢复之间的差距，我们提出了Dream2Fix，一个直接从成功的现实世界演示中合成照片级真实感的反事实失败回放的框架。通过在生成的世界模型中扰动动作，Dream2Fix创建了配对的失败-修正数据，而无需依赖模拟器。为了确保生成的数据在机器人学习中是物理可行的，我们实施了一种结构化验证机制，严格过滤回放以确保任务有效性、视觉一致性和运动安全性。该引擎生成了超过12万对高保真样本的数据集。利用该数据集，我们微调了一个视觉-语言模型，以联合预测失败类型和精确的恢复轨迹，将视觉异常直接映射到修正动作。大量的现实世界机器人实验表明，我们的方法在修正准确性上达到了最先进的水平，从19.7%提高到81.3%，并成功实现了在物理部署中的零-shot闭环失败恢复。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2603.13531

Fabric Pneumatic Artificial Muscle-Based Head-Neck Exosuit: Design, Modeling, and Evaluation

基于织物气动人工肌肉的头颈外骨骼：设计、建模与评估

Schäffer, Katalin, Bales, Ian, Zhang, Haohan, McGuinness, Margaret

Abstract

Wearable exosuits assist human movement in tasks ranging from rehabilitation to daily activities; specifically, head-neck support is necessary for patients with certain neurological disorders. Rigid-link exoskeletons have shown to enable head-neck mobility compared to static braces, but their bulkiness and restrictive structure inspire designs using "soft" actuation methods. In this paper, we propose a fabric pneumatic artificial muscle-based exosuit design for head-neck support. We describe the design of our prototype and physics-based model, enabling us to derive actuator pressures required to compensate for gravitational load. Our modeled range of motion and workspace analysis indicate that the limited actuator lengths impose slight limitations (83% workspace coverage), and gravity compensation imposes a more significant limitation (43% workspace coverage). We introduce compression force along the neck as a novel, potentially comfort-related metric. We further apply our model to compare the torque output of various actuator placement configurations, allowing us to select a design with stability in lateral deviation and high axial rotation torques. The model correctly predicts trends in measured data where wrapping the actuators around the neck is not a significant factor. Our test dummy and human user demonstration confirm that the exosuit can provide functional head support and trajectory tracking, underscoring the potential of artificial muscle-based soft actuation for head-neck mobility assistance.

Chinese Translation

可穿戴外骨骼在从康复到日常活动的任务中辅助人类运动；特别是，对于某些神经系统疾病患者，头颈部的支持是必要的。与静态支架相比，刚性连接的外骨骼已被证明能够实现头颈的灵活性，但其笨重和限制性的结构促使我们设计使用“软性”驱动方法。在本文中，我们提出了一种基于织物气动人工肌肉的头颈支持外骨骼设计。我们描述了原型的设计及其基于物理的模型，使我们能够推导出补偿重力负载所需的驱动器压力。我们的运动范围模型和工作空间分析表明，有限的驱动器长度对工作空间覆盖率造成了轻微限制（83%），而重力补偿则造成了更显著的限制（43%）。我们引入沿颈部的压缩力作为一种新颖的、可能与舒适性相关的指标。我们进一步应用我们的模型比较了不同驱动器布置配置的扭矩输出，从而选择了在横向偏移和高轴向旋转扭矩方面具有稳定性的设计。模型正确预测了测量数据中的趋势，即将驱动器缠绕在颈部并不是一个显著因素。我们的测试假人和人类用户演示确认了该外骨骼能够提供功能性的头部支持和轨迹跟踪，强调了基于人工肌肉的软性驱动在头颈部运动辅助中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2603.13567

End-to-End O-RAN Testbed for Edge-AI-Enabled 5G/6G Connected Industrial Robotics

端到端 O-RAN 测试平台用于边缘人工智能驱动的 5G/6G 连接工业机器人

Talosi, Sasa, Vincan, Vladimir, Sobot, Srdjan, Martic, Goran, Morosev, Vladimir, Ninkovic, Vukan, Miskovic, Dragisa, Vukobratovic, Dejan

Abstract

Connected robotics is one of the principal use cases driving the transition towards more intelligent and capable 6G mobile cellular networks. Replacing wired connections with highly reliable, high-throughput, and low-latency 5G/6G radio interfaces enables robotic system mobility and the offloading of compute-intensive artificial intelligence (AI) models for robotic perception and control to servers located at the network edge. The transition towards Edge AI as a Service (E-AIaaS) simplifies on-site maintenance of robotic systems and reduces operational costs in industrial environments, while supporting flexible AI model life-cycle management and seamless upgrades of robotic functionalities over time. In this paper, we present a 5G/6G O-RAN-based end-to-end testbed that integrates E-AIaaS for connected industrial robotic applications. The objective is to design and deploy a generic experimental platform based on open technologies and interfaces, demonstrated through an E-AIaaS-enabled autonomous welding scenario. Within this scenario, the testbed is used to investigate trade-offs among different data acquisition, edge processing, and real-time streaming approaches for robotic perception, while supporting emerging paradigms such as semantic and goal-oriented communications.

Chinese Translation

连接机器人是推动向更智能、更强大 6G 移动蜂窝网络过渡的主要应用之一。用高度可靠、高吞吐量和低延迟的 5G/6G 无线接口替代有线连接，使机器人系统具备移动性，并将计算密集型的人工智能（AI）模型用于机器人感知和控制的任务卸载到位于网络边缘的服务器上。向边缘人工智能即服务（E-AIaaS）的过渡简化了机器人系统的现场维护，并降低了工业环境中的运营成本，同时支持灵活的 AI 模型生命周期管理和机器人功能的无缝升级。在本文中，我们提出了一个基于 5G/6G O-RAN 的端到端测试平台，集成了用于连接工业机器人应用的 E-AIaaS。我们的目标是设计和部署一个基于开放技术和接口的通用实验平台，通过一个启用 E-AIaaS 的自主焊接场景进行演示。在这个场景中，测试平台用于研究不同数据采集、边缘处理和实时流媒体方法在机器人感知中的权衡，同时支持语义和目标导向通信等新兴范式。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2603.13582

Creating manufacturable blueprints for coarse-grained virtual robots

为粗粒度虚拟机器人创建可制造的蓝图

Guo, Zihan, Li, Muhan, Zhang, Shuzhe, Kriegman, Sam

Abstract

Over the past three decades, countless embodied yet virtual agents have freely evolved inside computer simulations, but vanishingly few were realized as physical robots. This is because evolution was conducted at a level of abstraction that was convenient for freeform body generation (creation, mutation, recombination) but swept away almost all of the physical details of functional body parts. The resulting designs were crude and underdetermined, requiring considerable effort and expertise to convert into a manufacturable format. Here, we automate this mapping from simplified design spaces that are readily evolvable to complete blueprints that can be directly followed by a builder. The pipeline incrementally resolves manufacturing constraints by embedding the structural and functional semantics of motors, electronics, batteries, and wiring into the abstract virtual design. In lieu of evolution, a user-defined or AI-generated ``sketch'' of a body plan can also be fed as input to the pipeline, providing a versatile framework for accelerating the design of novel robots.

Chinese Translation

在过去的三十年里，无数具身但虚拟的智能体在计算机模拟中自由演化，但真正实现为物理机器人的却寥寥无几。这是因为演化是在一个便于自由形态生成（创建、变异、重组）的抽象层次上进行的，但几乎忽略了功能性身体部件的所有物理细节。由此产生的设计粗糙且不确定，需要相当大的努力和专业知识才能转换为可制造的格式。在此，我们自动化了从易于演化的简化设计空间到可以直接供建造者遵循的完整蓝图的映射。该流程通过将电动机、电子元件、电池和布线的结构和功能语义嵌入到抽象虚拟设计中，逐步解决制造约束。除了演化外，用户定义或AI生成的身体计划“草图”也可以作为输入提供给该流程，从而为加速新型机器人设计提供了一个灵活的框架。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2603.13585

Sonar-MASt3R: Real-Time Opti-Acoustic Fusion in Turbid, Unstructured Environments

Sonar-MASt3R：浑浊非结构化环境中的实时光声融合

Phung, Amy, Camilli, Richard

Abstract

Underwater intervention is an important capability in several marine domains, with numerous industrial, scientific, and defense applications. However, existing perception systems used during intervention operations rely on data from optical cameras, which limits capabilities in poor visibility or lighting conditions. Prior work has examined opti-acoustic fusion methods, which use sonar data to resolve the depth ambiguity of the camera data while using camera data to resolve the elevation angle ambiguity of the sonar data. However, existing methods cannot achieve dense 3D reconstructions in real-time, and few studies have reported results from applying these methods in a turbid environment. In this work, we propose the opti-acoustic fusion method Sonar-MASt3R, which uses MASt3R to extract dense correspondences from optical camera data in real-time and pairs it with geometric cues from an acoustic 3D reconstruction to ensure robustness in turbid conditions. Experimental results using data recorded from an opti-acoustic eye-in-hand configuration across turbidity values ranging from <0.5 to >12 NTU highlight this method's improved robustness to turbidity relative to baseline methods.

Chinese Translation

水下干预是多个海洋领域的重要能力，具有众多工业、科学和国防应用。然而，现有的干预操作感知系统依赖于光学相机的数据，这在能见度或光照条件较差时限制了其能力。之前的研究探讨了光声融合方法，这些方法利用声纳数据解决相机数据的深度模糊性，同时利用相机数据解决声纳数据的高度角模糊性。然而，现有方法无法实现实时的高密度3D重建，且很少有研究报告在浑浊环境中应用这些方法的结果。在本研究中，我们提出了光声融合方法Sonar-MASt3R，该方法使用MASt3R实时提取光学相机数据中的高密度对应关系，并与声学3D重建中的几何线索配对，以确保在浑浊条件下的鲁棒性。使用从光声眼手配置中记录的数据进行的实验结果显示，在浑浊值范围从<0.5到>12 NTU的情况下，该方法相较于基线方法在鲁棒性方面有显著改善。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2603.13616

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

超越二元成功：样本高效且统计严谨的机器人策略比较

Snyder, David, Badithela, Apurva, Matni, Nikolai, Pappas, George, Majumdar, Anirudha, Itkina, Masha, Nishimura, Haruki

Abstract

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.

Chinese Translation

通用机器人操作策略的能力日益增强，但在评估时受到硬件实验次数的限制。这一在现实世界测试中的强大资源约束迫使我们需要更加信息丰富的性能指标和可靠有效的评估程序，以正确评估模型能力并衡量该领域的进展。本文提出了一种新颖的机器人策略比较框架，该框架具备样本高效、统计严谨且适用于实践中使用的广泛评估指标。基于安全、随时有效推断（SAVI），我们的测试程序是顺序的，允许评估者在累积到足够的统计证据后，依据预设的置信水平提前停止评估。与先前为二元成功开发的工作不同，我们的统一方法涵盖了广泛的信息性指标：从离散的部分信用任务进展到连续的情节奖励或轨迹平滑度度量，涵盖参数和非参数比较问题。通过对模拟和现实世界评估数据的广泛验证，我们证明了与标准批量方法相比，评估负担减少了多达70%，与为二元结果设计的先进顺序程序相比，减少了多达50%，且没有损失统计严谨性。值得注意的是，我们的实证结果显示，当使用细粒度任务进展而非二元成功指标时，可以更快速地区分竞争策略。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2603.13698

SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs

SAATT Nav：一种面向轮椅的社会意识自主透明交通导航框架

Zhang, Yutong, Mehra, Shaiv Y., Duerstock, Bradley S., Wachs, Juan P.

Abstract

While powered wheelchairs reduce physical fatigue as opposed to manual wheelchairs for individuals with mobility impairment, they demand high cognitive workload due to information processing, decision making and motor coordination. Current autonomous systems lack social awareness in navigation and transparency in decision-making, leading to decreased perceived safety and trust from the user and others in context. This work proposes Socially Aware Autonomous Transparent Transportation (SAATT) Navigation framework for wheelchairs as a potential solution. By implementing a Large Language Model (LLM) informed of user intent and capable of predicting other peoples' intent as a decision-maker for its local controller, it is able to detect and navigate social situations, such as passing pedestrians or a pair conversing. Furthermore, the LLM textually communicates its reasoning at each waypoint for transparency. In this experiment, it is compared against a standard global planner, a representative competing social navigation model, and an Ablation study in three simulated environments varied by social levels in eight metrics categorized under Safety, Social Compliance, Efficiency, and Comfort. Overall, SAATT Nav outperforms in most social situations and equivalently or only slightly worse in the remaining metrics, demonstrating the potential of a socially aware and transparent autonomous navigation system to assist wheelchair users.

Chinese Translation

尽管电动轮椅相比手动轮椅能减轻行动不便者的身体疲劳，但由于信息处理、决策和运动协调，它们对认知负荷的要求较高。目前的自主系统在导航时缺乏社会意识和决策透明性，导致用户及周围人对安全性和信任感的降低。本研究提出了社会意识自主透明交通导航框架（SAATT Nav）作为轮椅的一种潜在解决方案。通过实施一个大型语言模型（Large Language Model, LLM），该模型能够理解用户意图并预测他人的意图，作为其本地控制器的决策者，它能够检测和应对社会情境，例如经过的行人或交谈中的一对人。此外，LLM在每个航点以文本形式传达其推理过程，以确保透明性。在本实验中，它与标准全局规划器、一个具有代表性的竞争性社会导航模型以及在三个模拟环境中进行的消融研究进行了比较，这些环境在八个指标下按社会水平进行了分类，包括安全性、社会合规性、效率和舒适度。总体而言，SAATT Nav在大多数社会情境中表现优于其他模型，在剩余指标中表现相当或仅稍逊，展示了社会意识和透明的自主导航系统在辅助轮椅用户方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2603.13707

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

REFINE-DP：通过强化学习进行类人机器人运动操控的扩散策略微调

Gu, Zhaoyuan, Chen, Yipu, Chai, Zimeng, Cueva, Alfred, Nguyen, Thong, Wu, Yifan, Xue, Huishu, Kim, Minji, Legene, Isaac, Liu, Fukang, Kim, Matthew, Barula, Ayan, Chen, Yongxin, Zhao, Ye

Abstract

Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. https://refine-dp.github.io/REFINE-DP/

Chinese Translation

类人机器人运动操控需要在复杂的机器人与环境动态以及长时间任务下，协调高层运动规划与稳定的低层全身执行。尽管扩散策略（DP）在从示范中学习方面展现出潜力，但在类人机器人上部署它们面临着关键挑战：离线训练的运动规划器与低层控制器解耦，导致指令跟踪不佳、分布偏移加剧和任务失败。对高维类人系统而言，常见的扩展示范数据的方法成本过高。为了解决这一挑战，我们提出了REFINE-DP（扩散策略的强化学习微调），这是一个分层框架，联合优化DP高层规划器和基于强化学习的低层运动操控控制器。通过基于PPO的扩散策略梯度对DP进行微调，以提高任务成功率，同时控制器也被同步更新，以准确跟踪规划器不断变化的指令分布，从而减少降低运动质量的分布不匹配。我们在类人机器人上验证了REFINE-DP，执行运动操控任务，包括门的穿越和长时间物体运输。REFINE-DP在仿真中实现了超过90%的成功率，即使在未见过的分布外案例中，也能实现顺畅的自主任务执行于现实动态环境中。我们提出的方法显著优于预训练的DP基线，并证明了强化学习微调是可靠的类人机器人运动操控的关键。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2603.13732

LPV-MPC for Lateral Control in Full-Scale Autonomous Racing

全尺度自主赛车的线性参数变化模型预测控制（LPV-MPC）

Jardali, Hassan, Mohamed, Ihab S., Pushp, Durgakant, Liu, Lantao

Abstract

Autonomous racing has attracted significant attention recently, presenting challenges in selecting an optimal controller that operates within the onboard system's computational limits and meets operational constraints such as limited track time and high costs. This paper introduces a Linear Parameter-Varying Model Predictive Controller (LPV-MPC) for lateral control. Implemented on an IAC AV-24, the controller achieved stable performance at speeds exceeding 160 mph (71.5 m/s). We detail the controller design, the methodology for extracting model parameters, and key system-level and implementation considerations. Additionally, we report results from our final race run, providing a comprehensive analysis of both vehicle dynamics and controller performance. A Python implementation of the framework is available at: https://tinyurl.com/LPV-MPC-acados

Chinese Translation

自主赛车近年来引起了广泛关注，面临在车载系统计算限制内选择最佳控制器的挑战，同时满足有限赛道时间和高成本等操作约束。本文提出了一种用于横向控制的线性参数变化模型预测控制器（LPV-MPC）。该控制器在IAC AV-24上实施，实现了超过160 mph（71.5 m/s）速度下的稳定性能。我们详细介绍了控制器设计、模型参数提取的方法论以及关键的系统级和实施考虑。此外，我们报告了最终比赛运行的结果，提供了对车辆动力学和控制器性能的全面分析。该框架的Python实现可在以下网址获取：https://tinyurl.com/LPV-MPC-acados

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2603.13733

Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control

实时生成模型预测控制的隐式最大似然估计

Lee, Grayson, Bui, Minh, Zhou, Shuzi, Li, Yankai, Chen, Mo, Li, Ke

Abstract

Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a closed-loop human navigation scenario, operating in real-time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.

Chinese Translation

基于扩散的模型最近在轨迹规划中表现出色，因为它们能够捕捉复杂行为的多样性和多模态分布。这些模型的一个关键限制是其缓慢的推理速度，这源于迭代去噪过程。这使得它们不太适合实时应用，如闭环模型预测控制（MPC），在这些应用中，计划必须快速生成并持续适应变化的环境。在本文中，我们研究了隐式最大似然估计（IMLE）作为一种替代的生成建模方法用于规划。IMLE提供了强大的模式覆盖，同时使推理速度提高两个数量级，特别适合实时MPC任务。我们的结果表明，IMLE在标准离线强化学习基准测试中与标准的基于扩散的规划器相比，表现出竞争力，同时在开放环路和闭环设置中显著提高了规划速度。我们进一步在一个闭环人类导航场景中验证了IMLE的有效性，实时操作，展示了它如何在动态环境中实现快速和自适应的计划生成。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2603.13748

Multi-Robot Coordination for Planning under Context Uncertainty

在上下文不确定性下的多机器人协调规划

Rustagi, Pulkit, Wray, Kyle Hollins, Saisubramanian, Sandhya

Abstract

Real-world robots often operate in settings where objective priorities depend on the underlying context of operation. When the underlying context is unknown apriori, multiple robots may have to coordinate to gather informative observations to infer the context, since acting based on an incorrect context can lead to misaligned and unsafe behavior. Once the underlying true context is inferred, the robots optimize their task-specific objectives in the preference order induced by the context. We formalize this problem as a Multi-Robot Context-Uncertain Stochastic Shortest Path (MR-CUSSP), which captures context-relevant information at landmark states through joint observations. Our two-stage solution approach is composed of: (1) CIMOP (Coordinated Inference for Multi-Objective Planning) to compute plans that guide robots toward informative landmarks to efficiently infer the true context, and (2) LCBS (Lexicographic Conflict-Based Search) for collision-free multi-robot path planning with lexicographic objective preferences, induced by the context. We evaluate the algorithms using three simulated domains and demonstrate its practical applicability using five mobile robots in the salp domain setup.

Chinese Translation

现实世界中的机器人通常在目标优先级依赖于操作背景的环境中运行。当底层背景未知时，多台机器人可能需要协调以收集信息性观察，从而推断背景，因为基于错误背景进行操作可能导致不一致和不安全的行为。一旦推断出真实的背景，机器人将根据背景引发的优先顺序优化其特定任务目标。我们将此问题形式化为多机器人上下文不确定随机最短路径问题（Multi-Robot Context-Uncertain Stochastic Shortest Path, MR-CUSSP），该问题通过联合观察在地标状态下捕捉与上下文相关的信息。我们的两阶段解决方案包括：（1）CIMOP（多目标规划的协调推断）用于计算计划，引导机器人朝向信息性地标，以高效推断真实背景；（2）LCBS（基于词典冲突的搜索）用于具有上下文引发的词典目标偏好的无碰撞多机器人路径规划。我们使用三个模拟领域评估这些算法，并通过在salp领域设置中使用五个移动机器人展示其实际应用性。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2603.13756

Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

探索辅助的瓶颈过渡：面向鲁棒性和数据高效的可变形物体操控

Onishi, Yujiro, Takizawa, Ryo, Ohmura, Yoshiyuki, Kuniyoshi, Yasuo

Abstract

Imitation learning has demonstrated impressive results in robotic manipulation but fails under out-of-distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near-infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often impractical for DOM owing to its inherent complexities, including self-occlusion. To address the OOD problem in DOM, we propose a novel framework, Exploration-assisted Bottleneck Transition for Deformable Object Manipulation (ExBot), which addresses the OOD challenge through two key advantages. First, we introduce bottleneck states, standardized configurations that serve as starting points for task execution. This enables the reconceptualization of OOD challenges as the problem of transitioning diverse initial states to these bottleneck states, significantly reducing demonstration requirements. Second, to account for imperfect perception, we partition the OOD state space based on recognizability and employ dual action primitives. This approach enables ExBot to manipulate even unrecognizable states without requiring accurate perception. By concentrating demonstrations around bottleneck states and leveraging exploration to alter perceptual conditions, ExBot achieves both data efficiency and robustness to severe OOD scenarios. Real-world experiments on rope and cloth manipulation demonstrate successful task completion from diverse OOD states, including severe self-occlusions.

Chinese Translation

模仿学习在机器人操控中展现了令人印象深刻的结果，但在分布外（OOD）状态下表现不佳。这一局限性在可变形物体操控（DOM）中尤为关键，因为几乎无限的可能配置使得全面的数据收集变得不可行。尽管已有几种方法解决了OOD状态问题，但它们通常需要详尽的数据或高度精确的感知。这些要求对于DOM来说往往不切实际，因为其固有的复杂性，包括自遮挡。为了解决DOM中的OOD问题，我们提出了一种新颖的框架——探索辅助的瓶颈过渡（ExBot），通过两个关键优势应对OOD挑战。首先，我们引入了瓶颈状态，这些标准化配置作为任务执行的起始点。这使得将OOD挑战重新概念化为将多样的初始状态过渡到这些瓶颈状态的问题，从而显著减少了演示需求。其次，为了考虑不完美的感知，我们基于可识别性对OOD状态空间进行划分，并采用双重动作原语。这种方法使得ExBot能够操控甚至不可识别的状态，而无需准确的感知。通过将演示集中在瓶颈状态周围，并利用探索改变感知条件，ExBot在严重的OOD场景中实现了数据高效性和鲁棒性。针对绳索和布料操控的真实世界实验展示了从多样的OOD状态成功完成任务，包括严重的自遮挡。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2603.13781

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

KoopmanFlow：通过Koopman结构偏差实现光谱解耦的生成控制策略

Yao, Chengsi, Wang, Ge, Kang, Kai, Yan, Shenhao, Yang, Jiahao, Feng, Fan, Cai, Honghao, Zeng, Xianxian, Chen, Rongjun, Zhao, Yiming, Han, Yatong, Li, Xi

Abstract

Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC), uniform ODE solvers mathematically smooth over sparse, high-frequency transients entangled within low-frequency steady states. To decouple these dynamics without accumulating pipelined errors, we introduce KoopmanFlow, a parameter-efficient generative policy guided by a Koopman-inspired structural inductive bias. Operating in a unified multimodal latent space with visual context, KoopmanFlow bifurcates generation at the terminal stage. Because visual conditioning occurs before spectral decomposition, both branches are visually guided yet temporally specialized. A macroscopic branch anchors slow-varying trajectories via single-step Consistency Training, while a transient branch uses Flow Matching to isolate high-frequency residuals stimulated by sudden visual cues (e.g., contacts or occlusions). Guided by an explicit spectral prior and optimized via a novel asymmetric consistency objective, KoopmanFlow establishes a fused co-training mechanism. This allows the variant branch to absorb localized dynamics without multi-stage error accumulation. Extensive experiments show KoopmanFlow significantly outperforms state-of-the-art baselines in contact-rich tasks requiring agile disturbance rejection. By trading a surplus latency buffer for a richer structural prior, KoopmanFlow achieves superior control fidelity and parameter efficiency within real-time deployment limits.

Chinese Translation

生成控制策略（GCPs）在机器人操控中展现出巨大的潜力，但在同时建模稳定的全局运动和高频局部修正方面面临挑战。尽管现代架构提取了多尺度空间特征，但其基础的概率流常微分方程（ODEs）采用统一的时间积分调度。为了实现实时的回退控制（RHC），这些ODE被压缩为单步，统一的ODE求解器在数学上平滑了低频稳态中交织的稀疏高频瞬态。为了在不累积管道错误的情况下解耦这些动态，我们引入了KoopmanFlow，这是一种受Koopman启发的结构归纳偏差指导的参数高效生成策略。KoopmanFlow在统一的多模态潜在空间中与视觉上下文共同运行，在终端阶段分叉生成。由于视觉条件在光谱分解之前发生，因此两个分支在视觉上受到指导，但在时间上具有专业化。宏观分支通过单步一致性训练锚定缓慢变化的轨迹，而瞬态分支则利用流匹配来隔离由突发视觉线索（例如接触或遮挡）刺激的高频残差。在明确的光谱先验指导下，并通过一种新颖的非对称一致性目标进行优化，KoopmanFlow建立了一个融合的共同训练机制。这使得变体分支能够吸收局部动态，而不会造成多阶段错误累积。大量实验表明，KoopmanFlow在需要敏捷干扰拒绝的接触丰富任务中显著优于最先进的基线。通过将多余的延迟缓冲换取更丰富的结构先验，KoopmanFlow在实时部署限制内实现了卓越的控制保真度和参数效率。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2603.13782

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

您的视觉-语言-动作模型已经具备路径偏差检测的注意力头

Jeong, Jaehwan, Zhu, Evelyn, Lin, Jinying, Jaimes, Emmanuel, Vu, Tuan-Anh, Joo, Jungseock, Kim, Sangpil, Jawed, M. Khalid

Abstract

Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.

Chinese Translation

视觉-语言-动作（VLA）模型在导航任务中展示了预测语义动作的强大潜力，能够对复杂的语言指令和视觉上下文进行推理。然而，它们在根本上受到视觉推理幻觉的限制，这导致了轨迹偏差。解决这一问题通常需要训练外部评判模块或依赖复杂的不确定性启发式方法。在本研究中，我们发现监控一个冻结的VLA模型中的少数注意力头可以准确检测路径偏差，而无需额外的计算开销。我们将这些头称为导航头（Navigation Heads），它们本质上捕捉了历史视觉序列与语言指令之间的时空因果关系。利用这些头，我们提出了一种直观的、无训练的异常检测框架，实时监控其信号以检测幻觉。令人惊讶的是，在超过一千个注意力头中，仅仅三个的组合就足以以44.6%的偏差检测率和11.7%的低误报率实现检测。此外，在检测到偏差后，我们绕过重型VLA模型，触发轻量级的强化学习（Reinforcement Learning, RL）策略，以安全地执行最短路径回退。通过将整个检测到恢复的流程集成到物理机器人上，我们展示了其实际的鲁棒性。所有源代码将公开提供。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2603.13785

Robust Sim-to-Real Cloth Untangling through Reduced-Resolution Observations via Adaptive Force-Difference Quantization

通过自适应力差量化实现鲁棒的模拟到现实布料解缠

Tsurumine, Yoshihisa, Kadokawa, Yuki, Hayashi, Kohei, Diehm, Christian, Matsubara, Takamitsu

Abstract

Robotic cloth untangling requires progressively disentangling fabric by adapting pulling actions to changing contact and tension conditions. Because large-scale real-world training is impractical due to cloth damage and hardware wear, sim-to-real policy transfer is a promising solution. However, cloth manipulation is highly sensitive to interaction dynamics, and policies that depend on precise force magnitudes often fail after transfer because similar force responses cannot be reproduced due to the reality gap. We observe that untangling is largely characterized by qualitative tension transitions rather than exact force values. This indicates that directly minimizing the sim-to-real gap in raw force measurements does not necessarily align with the task structure. We therefore hypothesize that emphasizing coarse force-change patterns while suppressing fine environment-dependent variations can improve robustness of sim-to-real transfer. Based on this insight, we propose Adaptive Force-Difference Quantization (ADQ), which reduces observation resolution by representing force inputs as discretized temporal differences and learning state-dependent quantization thresholds adaptively. This representation mitigates overfitting to environment-specific force characteristics and facilitates direct sim-to-real transfer. Experiments in both simulation and real-world cloth untangling demonstrate that ADQ achieves higher success rates and exhibits greater robustness in sim-to-real transfer than policies using raw force inputs. Supplementary video is available at https://youtu.be/ZeoBs-t0AWc

Chinese Translation

机器人布料解缠需要通过适应拉动动作来逐步解开织物，以应对不断变化的接触和张力条件。由于布料损坏和硬件磨损使得大规模的现实世界训练不切实际，模拟到现实的策略迁移成为一种有前景的解决方案。然而，布料操作对交互动态高度敏感，依赖于精确力值的策略在迁移后往往失败，因为由于现实差距，无法重现相似的力响应。我们观察到，解缠主要由定性的张力变化特征所表征，而非精确的力值。这表明，直接最小化原始力测量中的模拟到现实差距并不一定与任务结构对齐。因此，我们假设强调粗略的力变化模式，同时抑制细微的环境依赖变化，可以提高模拟到现实迁移的鲁棒性。基于这一见解，我们提出了自适应力差量化（Adaptive Force-Difference Quantization, ADQ），通过将力输入表示为离散化的时间差异并自适应学习状态依赖的量化阈值来降低观察分辨率。这种表示方法减轻了对特定环境力特征的过拟合，并促进了直接的模拟到现实迁移。在模拟和现实世界布料解缠的实验中，ADQ实现了更高的成功率，并在模拟到现实迁移中表现出更大的鲁棒性，优于使用原始力输入的策略。补充视频可在 https://youtu.be/ZeoBs-t0AWc 获取。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2603.13788

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

ST-VLA：实现具有4D感知的时空理解以支持通用机器人操作

Wu, You, Chen, Zixuan, Ou, Cunxu, Wang, Wenxuan, Huang, Wenbo, Cao, Lin, Chen, Yangtao, Qiu, Weichao, Quan, Xingyue, Shi, Jieqi, Huo, Jing, Gao, Yang

Abstract

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.

Chinese Translation

在开放世界环境中，机器人操作需要在语义、几何和长时程动作动态之间进行推理。现有的层次化视觉-语言-动作（VLA）框架通常使用2D表示来连接高层次推理与低层次控制，但缺乏深度感知和时间一致性，限制了在复杂3D场景中的鲁棒性。我们提出了ST-VLA，一个使用统一的3D-4D表示的层次化VLA框架，以桥接感知与动作。ST-VLA将2D指导转换为3D轨迹，并生成平滑的空间掩模，以捕捉4D时空上下文，提供语义推理与连续控制之间的稳定接口。为了有效学习这样的表示，我们引入了ST-Human，一个大规模的人类操作数据集，包含14个任务和30万集，通过半自动化流程标注了2D、3D和4D监督。利用ST-Human，我们训练了ST-VLM，一个时空视觉-语言模型，生成空间上有依据且时间上连贯的3D表示，以指导策略执行。平滑的空间掩模聚焦于与任务相关的几何形状并稳定潜在表示，从而实现在线重新规划和长时程推理。在RLBench和真实世界操作任务上的实验表明， extit{method}显著超越了最先进的基线，零-shot成功率提高了44.6%和30.3%。这些结果表明，将时空推理卸载到具有统一3D-4D表示的VLM中，显著提高了开放世界机器人操作的鲁棒性和泛化能力。项目网站：https://oucx117.github.io/ST-VLA/

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2603.13825

Building Explicit World Model for Zero-Shot Open-World Object Manipulation

构建显式世界模型以实现零-shot开放世界物体操作

Li, Xiaotong, Chen, Gang, Alonso-Mora, Javier

Abstract

Open-world object manipulation remains a fundamental challenge in robotics. While Vision-Language-Action (VLA) models have demonstrated promising results, they rely heavily on large-scale robot action demonstrations, which are costly to collect and can hinder out-of-distribution generalization. In this paper, we propose an explicit-world-model-based framework for open-world manipulation that achieves zero-shot generalization by constructing a physically grounded digital twin of the environment. The framework integrates open-set perception, digital-twin reconstruction, sampling and evaluation of interaction strategies. By constructing a digital twin of the environment, our approach efficiently explores and evaluates manipulation strategies in physic-enabled simulator and reliably deploys the chosen strategy to the real world. Experimentally, the proposed framework is able to perform multiple open-set manipulation tasks without any task-specific action demonstrations, proving strong zero-shot generalization on both the task and object levels. Project Page: https://bojack-bj.github.io/projects/thesis/

Chinese Translation

开放世界物体操作仍然是机器人技术中的一个基本挑战。尽管视觉-语言-动作（Vision-Language-Action, VLA）模型已显示出良好的效果，但它们在很大程度上依赖于大规模的机器人动作演示，这些演示的收集成本高昂，并可能阻碍分布外的泛化。在本文中，我们提出了一种基于显式世界模型的开放世界操作框架，通过构建环境的物理基础数字双胞胎实现零-shot 泛化。该框架集成了开放集感知、数字双胞胎重建、交互策略的采样与评估。通过构建环境的数字双胞胎，我们的方法能够高效地探索和评估在物理启用模拟器中的操作策略，并可靠地将所选策略部署到现实世界中。实验表明，所提出的框架能够在没有任何特定任务的动作演示的情况下执行多个开放集操作任务，在任务和物体层面上证明了强大的零-shot 泛化能力。项目页面：https://bojack-bj.github.io/projects/thesis/

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2603.13829

ArrayTac: A tactile display for simultaneous rendering of shape, stiffness and friction

ArrayTac：一种能够同时呈现形状、硬度和摩擦的触觉显示器

Liang, Tianhai, Guo, Shiyi, Cheng, Baiye, Xue, Zhengrong, Zhang, Han, Xu, Huazhe

Abstract

Human-computer interaction in the visual and auditory domains has achieved considerable maturity, yet machine-to-human tactile feedback remains underdeveloped. Existing tactile displays struggle to simultaneously render multiple tactile dimensions, such as shape, stiffness, and friction, which limits the realism of haptic simulation. Here, we present ArrayTac, a piezoelectric-driven tactile display capable of simultaneously rendering shape, stiffness, and friction to reproduce realistic haptic signals. The system comprises a 4x4 array of 16 actuator units, each employing a three-stage micro-lever mechanism to amplify the micrometer-scale displacement of the piezoelectric element, with Hall sensor-based closed-loop control at the end effector to enhance response speed and precision. We further implement two end-to-end pipelines: 1) a vision-to-touch framework that converts visual inputs into tactile signals using multimodal foundation models, and 2) a real-time tele-palpation system operating over distances of several thousand kilometers. In user studies, first-time participants accurately identify object shapes and physical properties with high success rates. In a tele-palpation experiment over 1,000km, untrained volunteers correctly identified both the number and type of tumors in a breast phantom with 100% accuracy and precisely localized their positions. The system pioneers a new pathway for high-fidelity haptic feedback by introducing the unprecedented capability to simultaneously render an object's shape, stiffness, and friction, delivering a holistic tactile experience that was previously unattainable.

Chinese Translation

在人机交互的视觉和听觉领域已经取得了相当的成熟，然而机器对人类的触觉反馈仍然不够发达。现有的触觉显示器在同时呈现多个触觉维度（如形状、硬度和摩擦）方面存在困难，这限制了触觉仿真效果的真实性。本文介绍了ArrayTac，一种基于压电驱动的触觉显示器，能够同时呈现形状、硬度和摩擦，从而重现真实的触觉信号。该系统由一个4x4阵列的16个执行器单元组成，每个单元采用三阶段微杠杆机制以放大压电元件的微米级位移，并在末端效应器处使用基于霍尔传感器的闭环控制，以提高响应速度和精准度。此外，我们实现了两个端到端的处理流程：1）一个视觉到触觉的框架，将视觉输入转化为触觉信号，使用多模态基础模型；2）一个实时远程触诊系统，能够在数千公里的距离内工作。在用户研究中，首次参与者以高成功率准确识别物体形状和物理特性。在一项超过1000公里的远程触诊实验中，未经训练的志愿者以100%的正确率准确识别了乳腺假体中的肿瘤数量和类型，并精准定位其位置。该系统开辟了一条高保真触觉反馈的新途径，首次引入同时呈现物体形状、硬度和摩擦的前所未有的能力，提供了之前无法实现的整体触觉体验。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2603.13832

GraspADMM: Improving Dexterous Grasp Synthesis via ADMM Optimization

GraspADMM：通过ADMM优化提升灵巧抓取合成

Ruan, Liangwang, Chen, Jiayi, Wang, He, Chen, Baoquan

Abstract

Synthesizing high-quality dexterous grasps is a fundamental challenge in robot manipulation, requiring adherence to diversity, kinematic feasibility (valid hand-object contact without penetration), and dynamic stability (secure multi-contact forces). The recent framework Dexonomy successfully ensures broad grasp diversity through dense sampling and improves kinematic feasibility via a simulator-based refinement method that excels at resolving exact collisions. However, its reliance on fixed contact points restricts the hand's reachability and prevents the optimization of grasp metrics for dynamic stability. Conversely, purely gradient-based optimizers can maximize dynamic stability but rely on simplified contact approximations that inevitably cause physical penetrations. To bridge this gap, we propose GraspADMM, a novel grasp synthesis framework that preserves sampling-based diversity while improving kinematic feasibility and dynamic stability. By formulating the refinement stage using the Alternating Direction Method of Multipliers (ADMM), we decouple the target contact points on the object from the actual contact locations on the hand. This decomposition allows the pipeline to alternate between updating the target object points to directly maximize dynamic grasp metrics, and adjusting the hand pose to physically reach these targets while strictly respecting collision boundaries. Extensive experiments demonstrate that GraspADMM significantly outperforms state-of-the-art baselines, achieving a nearly 15\% absolute improvement in grasp success rate for type-unaware synthesis and roughly a 100\% relative improvement in type-aware synthesis. Furthermore, our approach maintains robust, physically plausible grasp generation even under extreme low-friction conditions.

Chinese Translation

合成高质量的灵巧抓取是机器人操作中的一个基本挑战，要求遵循多样性、运动学可行性（有效的手-物体接触而不发生穿透）和动态稳定性（安全的多接触力）。最近的框架Dexonomy通过密集采样成功确保了广泛的抓取多样性，并通过基于模拟器的精细化方法改善了运动学可行性，该方法在解决精确碰撞方面表现出色。然而，其对固定接触点的依赖限制了手的可达性，并阻碍了动态稳定性抓取指标的优化。相反，纯粹基于梯度的优化器可以最大化动态稳定性，但依赖于简化的接触近似，这不可避免地导致物理穿透。为了弥补这一差距，我们提出了GraspADMM，这是一种新颖的抓取合成框架，既保留了基于采样的多样性，又改善了运动学可行性和动态稳定性。通过使用交替方向乘子法（ADMM）来构建精细化阶段，我们将目标接触点与手上的实际接触位置解耦。这种分解使得该流程能够交替更新目标物体点，以直接最大化动态抓取指标，并调整手的姿态以物理方式到达这些目标，同时严格遵守碰撞边界。大量实验表明，GraspADMM显著优于最先进的基线，在类型无关合成中实现了近15%的绝对抓取成功率提升，在类型相关合成中实现了大约100%的相对提升。此外，我们的方法在极低摩擦条件下仍能保持稳健且物理上合理的抓取生成。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2603.13833

ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics

ImagiNav：通过生成视觉预测和逆动力学实现可扩展的具身导航

Chen, Jie, Cai, Yuxin, Wang, Yizhuo, Bai, Ruofei, Cao, Yuhong, Li, Jun, Yun, Yau Wei, Sartoretti, Guillaume

Abstract

Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. Yet, Vision-Language Navigation has relied on end-to-end policies trained on expensive, embodiment-specific robot data. While recent foundation models trained on vast simulation data show promise, the challenge of scaling and generalizing due to the limited scene diversity and visual fidelity in simulation persists. To address this gap, we propose ImagiNav, a novel modular paradigm that decouples visual planning from robot actuation, enabling the direct utilization of diverse in-the-wild navigation videos. Our framework operates as a hierarchy: a Vision-Language Model first decomposes instructions into textual subgoals; a finetuned generative video model then imagines the future video trajectory towards that subgoal; finally, an inverse dynamics model extracts the trajectory from the imagined video, which can then be tracked via a low-level controller. We additionally develop a scalable data pipeline of in-the-wild navigation videos auto-labeled via inverse dynamics and a pretrained Vision-Language Model. ImagiNav demonstrates strong zero-shot transfer to robot navigation without requiring robot demonstrations, paving the way for generalist robots that learn navigation directly from unlabeled, open-world data.

Chinese Translation

使机器人能够通过自然语言在开放世界环境中导航对于通用自主性至关重要。然而，视觉-语言导航依赖于在昂贵的、特定于具身的机器人数据上训练的端到端策略。尽管最近在大量仿真数据上训练的基础模型显示出潜力，但由于仿真中场景多样性和视觉逼真度的限制，扩展和泛化的挑战依然存在。为了解决这一问题，我们提出了ImagiNav，一种新颖的模块化范式，将视觉规划与机器人执行解耦，能够直接利用多样的真实环境导航视频。我们的框架以层次结构运作：视觉-语言模型首先将指令分解为文本子目标；然后，经过微调的生成视频模型想象朝向该子目标的未来视频轨迹；最后，逆动力学模型从想象的视频中提取轨迹，该轨迹可以通过低级控制器进行跟踪。此外，我们还开发了一条可扩展的数据管道，利用逆动力学和预训练的视觉-语言模型对真实环境导航视频进行自动标注。ImagiNav展示了强大的零样本转移能力，能够在不需要机器人演示的情况下实现机器人导航，为直接从未标注的开放世界数据中学习导航的通用机器人铺平了道路。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2603.13842

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

微调不足：一种用于端到端自动驾驶的协同模仿与强化学习的并行框架

Lian, Zhexi, Wang, Haoran, Yan, Xuerun, Lin, Weimeng, Zhang, Xianhong, Chen, Yongyu, Hu, Jia

Abstract

End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.

Chinese Translation

端到端自动驾驶通常基于模仿学习（Imitation Learning, IL），但其性能受到人类示范质量的限制。为了克服这一局限，最近的方法通过顺序微调（sequential fine-tuning）引入强化学习（Reinforcement Learning, RL）。然而，这种范式仍然不够理想：顺序的RL微调可能引入策略漂移，并且由于依赖于预训练的IL策略，往往导致性能上限。为了解决这些问题，我们提出了PaIR-Drive，一种用于端到端自动驾驶的协同模仿与强化学习的通用并行框架。在训练过程中，PaIR-Drive将IL和RL分为两个并行分支，具有无冲突的训练目标，从而实现完全的协同优化。这一设计消除了在应用新IL策略时重新训练RL的必要性。在推理阶段，RL利用IL策略进一步优化最终计划，使得性能超越IL的先验知识。此外，我们在RL分支中引入了一种树状轨迹神经采样器，以实现相对策略优化（Group Relative Policy Optimization, GRPO），增强了探索能力。在NAVSIMv1和v2基准上的广泛分析表明，PaIR-Drive在Transfuser和DiffusionDrive IL基准上实现了91.2的PDMS和87.9的EPDMS的竞争性能。PaIR-Drive始终优于现有的RL微调方法，甚至能够纠正人类专家的次优行为。定性结果进一步确认，PaIR-Drive能够有效探索并生成高质量轨迹。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2603.13844

LDHP: Library-Driven Hierarchical Planning for Non-prehensile Dexterous Manipulation

LDHP：基于库驱动的非抓握灵巧操作分层规划

He, Tierui, Zhao, Chao

Abstract

Non-prehensile manipulation is essential for handling thin, large, or otherwise ungraspable objects in unstructured settings. Prior planning and search-based methods often rely on ad-hoc manual designs or generate physically unrealizable motions by ignoring critical gripper properties, while training-based approaches are data-intensive and struggle to generalize to novel, out-of-distribution tasks. We propose a library-driven hierarchical planner (LDHP) that makes executability a first-class design goal: a top-tier contact-state planner proposes object-pose paths using MoveObject primitives, and a bottom-tier grasp planner synthesizes feasible grasp sequences with AdjustGrasp primitives; feasibility is certified by collision checks and quasi-static mechanics, and contact-sensitive segments are recovered via a bounded dichotomy refinement. This gripper-aware decomposition decouples object motion from grasp realizability, yields a task-agnostic pipeline that transfers across manipulation tasks and geometric variations without re-design, and exposes clean hooks for optional learned priors. Real-robot studies on zero-mobility lifting and slot insertion demonstrate consistent execution and robustness to shape and environment changes.

Chinese Translation

非抓握操作对于在非结构化环境中处理薄、巨大或其他无法抓握的物体至关重要。以往的规划和基于搜索的方法往往依赖于临时的手动设计，或者通过忽视关键的抓取器特性而生成物理上不可实现的动作，而基于训练的方法则数据密集且难以推广到新颖的、分布外的任务。我们提出了一种库驱动的分层规划器（LDHP），将可执行性作为首要设计目标：顶层接触状态规划器使用 MoveObject 原语提出物体姿态路径，底层抓取规划器使用 AdjustGrasp 原语合成可行的抓取序列；通过碰撞检测和准静态力学来认证可行性，并通过有界二分法细化恢复接触敏感段。这种考虑抓取器的分解将物体运动与抓取可实现性解耦，产生一个与任务无关的管道，能够在不同的操作任务和几何变换中无须重新设计地进行转移，并为可选的学习先验提供了清晰的接口。对零移动提升和插槽插入的真实机器人研究展示了一致的执行效果和对形状及环境变化的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2603.13869

TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects

TransDex：基于点云重建的视觉-触觉策略预训练用于透明物体的灵巧操作

Li, Fengguan, Ma, Yifan, Qian, Chen, Rao, Wentao, Shang, Weiwei

Abstract

Dexterous manipulation enables complex tasks but suffers from self-occlusion, severe depth noise, and depth information loss when manipulating transparent objects. To solve this problem, this paper proposes TransDex, a 3D visuo-tactile fusion motor policy based on point cloud reconstruction pre-training. Specifically, we first propose a self-supervised point cloud reconstruction pre-training approach based on Transformer. This method accurately recovers the 3D structure of objects from interactive point clouds of dexterous hands, even when random noise and large-scale masking are added. Building on this, TransDex is constructed in which perceptual encoding adopts a fine-grained hierarchical scheme and multi-round attention mechanisms adaptively fuse features of the robotic arm and dexterous hand to enable differentiated motion prediction. Results from transparent object manipulation experiments conducted on a real robotic system demonstrate that TransDex outperforms existing baseline methods. Further analysis validates the generalization capabilities of TransDex and the effectiveness of its individual components.

Chinese Translation

灵巧操作能够完成复杂任务，但在操作透明物体时会遭遇自遮挡、严重的深度噪声和深度信息丢失等问题。为了解决这一问题，本文提出了TransDex，一种基于点云重建预训练的3D视觉-触觉融合运动策略。具体而言，我们首先提出了一种基于Transformer的自监督点云重建预训练方法。该方法能够准确恢复灵巧手的交互点云中的物体3D结构，即使在添加随机噪声和大规模遮挡的情况下也能有效工作。在此基础上，构建了TransDex，其中感知编码采用细粒度的层次方案，多轮注意机制自适应地融合机械臂和灵巧手的特征，以实现差异化的运动预测。在真实机器人系统上进行的透明物体操作实验结果表明，TransDex的性能优于现有的基线方法。进一步的分析验证了TransDex的泛化能力及其各个组成部分的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2603.13888

Path-conditioned Reinforcement Learning-based Local Planning for Long-Range Navigation

基于路径条件强化学习的长距离导航局部规划

Haro, Mateo, Richter, Julia, Yang, Fan, Cadena, Cesar, Hutter, Marco

Abstract

Long-range navigation is commonly addressed through hierarchical pipelines in which a global planner generates a path, decomposed into waypoints, and followed sequentially by a local planner. These systems are sensitive to global path quality, as inaccurate remote sensing data can result in locally infeasible waypoints, which degrade local execution. At the same time, the limited global context available to the local planner hinders long-range efficiency. To address this issue, we propose a reinforcement learning-based local navigation policy that leverages path information as contextual guidance. The policy is conditioned on reference path observations and trained with a reward function mainly based on goal-reaching objectives, without any explicit path-following reward. Through this implicit conditioning, the policy learns to opportunistically exploit path information while remaining robust to misleading or degraded guidance. Experimental results show that the proposed approach significantly improves navigation efficiency when high-quality paths are available and maintains baseline-level performance when path observations are severely degraded or even non-existent. These properties make the method particularly well-suited for long-range navigation scenarios in which high-level plans are approximate and local execution must remain adaptive to uncertainty.

Chinese Translation

长距离导航通常通过分层管道来解决，其中全局规划器生成路径，并将其分解为航点，由局部规划器顺序跟随。这些系统对全局路径质量敏感，因为不准确的遥感数据可能导致局部不可行的航点，从而降低局部执行效率。同时，局部规划器可用的有限全局上下文也妨碍了长距离效率。为了解决这一问题，我们提出了一种基于强化学习的局部导航策略，该策略利用路径信息作为上下文指导。该策略以参考路径观测为条件，并通过主要基于目标到达目标的奖励函数进行训练，而没有任何明确的路径跟随奖励。通过这种隐式条件，策略学习到机会性地利用路径信息，同时保持对误导性或降级指导的鲁棒性。实验结果表明，当高质量路径可用时，所提出的方法显著提高了导航效率，并在路径观测严重降级或甚至不存在时保持基线水平的性能。这些特性使得该方法特别适合于高层计划近似且局部执行必须对不确定性保持适应性的长距离导航场景。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2603.13907

LineMaster Pro: A Low-Cost Intelligent Line Following Robot with PID Control and Ultrasonic Obstacle Avoidance for Educational Robotics

LineMaster Pro：一种低成本智能循迹机器人，具有PID控制和超声波避障功能，适用于教育机器人

Shahi, Jeni, Shah, Abhishek, Akib, A. S. M. Ahsanul Sarkar

Abstract

Line following robots are fundamental platforms in robotics education, yet commercially available solutions remain prohibitively expensive ($150-300$) while lacking integrated obstacle detection capabilities essential for real-world applications. This paper presents LineMaster Pro, an intelligent low-cost line following robot implemented on an Arduino Nano platform that integrates dual TCRT5000 infrared sensors for precision line tracking, an HC-SR04 ultrasonic sensor for real-time obstacle detection, a digitally tuned PID controller with Ziegler-Nichols optimization, and a hierarchical finite state machine for robust obstacle avoidance. A systematic four-phase sensor calibration methodology ensures reliable operation across varying lighting and surface conditions. Experimental validation through 200 controlled trials and 72-hour continuous operation demonstrates mean tracking accuracy of 1.18 cm at 0.4 m/s (95\% CI [1.06, 1.30]), obstacle detection reliability of 96.7\% within 10-40 cm range with 0.7\% false positive rate, and 94\% successful recovery from path deviations. The PID implementation achieves 43\% improvement over conventional on-off control ($p<0.001$). At a total hardware cost of \$28.50 based on verified Bangladesh market prices, LineMaster Pro achieves a 94\% cost reduction compared to commercial alternatives, establishing a practical benchmark for accessible robotics education in resource-constrained environments.

Chinese Translation

循迹机器人是机器人教育中的基本平台，但市售解决方案仍然价格高昂（$150-300$），且缺乏集成的障碍物检测能力，这对于实际应用至关重要。本文介绍了LineMaster Pro，这是一款在Arduino Nano平台上实现的低成本智能循迹机器人，集成了双TCRT5000红外传感器以实现精确的线路跟踪、HC-SR04超声波传感器以进行实时障碍物检测、经过Ziegler-Nichols优化的数字调谐PID控制器，以及用于稳健避障的分层有限状态机。系统的四阶段传感器校准方法确保在不同的光照和表面条件下可靠运行。通过200次控制试验和72小时的连续运行进行的实验验证表明，在0.4 m/s的速度下，平均跟踪精度为1.18 cm（95 ext{% CI} [1.06, 1.30]），在10-40 cm范围内的障碍物检测可靠性为96.7 ext{%}，假阳性率为0.7 ext{%}，成功从路径偏离中恢复的比例为94 ext{%}。PID实现相比传统的开关控制提高了43 ext{%}（$p<0.001$）。基于经过验证的孟加拉国市场价格，总硬件成本为28.50美元，LineMaster Pro实现了与商业替代品相比94 ext{%}的成本降低，为资源有限环境中的可及机器人教育建立了实用基准。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2603.13908

Data-Driven Autoregressive Power Prediction for GTernal Robots in the Robotarium

基于数据驱动的GTernal机器人自回归功率预测

Abdelmeguid, Yassin, Hasan, Ammar

Abstract

Energy-aware algorithms for multi-robot systems require accurate power consumption models, yet existing approaches rely on kinematic approximations that fail to capture the complex dynamics of real hardware. We present a lightweight autoregressive predictor for the GTernal mobile robot platform deployed in the Georgia Tech Robotarium. Through analysis of 48,000 samples collected across six motion trials, we discover that power consumption exhibits strong temporal autocorrelation ($\rho_1 = 0.95$) that dominates kinematic effects. A 7,041-parameter multi-layer perceptron (MLP) achieves $R^2 = 0.90$ on held-out motion patterns by conditioning on recent power history, reaching the theoretical prediction ceiling imposed by measurement noise. Physical validation across seven robots in a collision avoidance scenario yields mean $R^2 = 0.87$, demonstrating zero-shot transfer to unseen robots and behaviors. The predictor runs in 224 $\mu$s per inference, enabling real-time deployment at 150$\times$ the platform's 30 Hz control rate. We release the trained model and dataset to support energy-aware multi-robot algorithm development.

Chinese Translation

面向多机器人系统的能量感知算法需要准确的功耗模型，但现有方法依赖于运动学近似，无法捕捉真实硬件的复杂动态。我们提出了一种轻量级的自回归预测器，适用于在乔治亚理工学院机器人实验室部署的GTernal移动机器人平台。通过分析在六个运动试验中收集的48,000个样本，我们发现功耗表现出强烈的时间自相关性（$ ho_1 = 0.95$），这一特性主导了运动学效应。一个具有7,041个参数的多层感知器（MLP）通过对近期功率历史进行条件化，在保留的运动模式上实现了$R^2 = 0.90$，达到了由测量噪声施加的理论预测上限。在七个机器人进行的碰撞避免场景中的物理验证显示平均$R^2 = 0.87$，证明了对未见过的机器人和行为的零样本迁移。该预测器每次推理运行时间为224微秒，使其能够以150倍于平台30 Hz控制频率的速度进行实时部署。我们发布了训练好的模型和数据集，以支持能量感知的多机器人算法开发。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2603.13925

SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization

SmoothVLA：通过内在平滑性优化将视觉-语言-动作模型与物理约束对齐

Li, Jiashun, Shi, Xiaoyu, Xie, Hong, Shang, Mingsheng, Lu, Yun

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. However, existing post-training methods face a dilemma between stability and exploration: Supervised Fine-Tuning (SFT) is constrained by demonstration quality and lacks generalization, whereas Reinforcement Learning (RL) improves exploration but often induces erratic, jittery trajectories that violate physical constraints. To bridge this gap, we propose SmoothVLA, a novel reinforcement learning fine-tuning framework that synergistically optimizes task performance and motion smoothness. The technical core is a physics-informed hybrid reward function that integrates binary sparse task rewards with a continuous dense term derived from trajectory jerk. Crucially, this reward is intrinsic, that computing directly from policy rollouts, without requiring extrinsic environment feedback or laborious reward engineering. Leveraging the Group Relative Policy Optimization (GRPO), SmoothVLA establishes trajectory smoothness as an explicit optimization prior, guiding the model toward physically feasible and stable control. Extensive experiments on the LIBERO benchmark demonstrate that SmoothVLA outperforms standard RL by 13.8\% in smoothness and significantly surpasses SFT in generalization across diverse tasks. Our work offers a scalable approach to aligning VLA models with physical-world constraints through intrinsic reward optimization.

Chinese Translation

视觉-语言-动作（VLA）模型已成为机器人操作的强大范式。然而，现有的后训练方法面临稳定性与探索之间的困境：监督微调（SFT）受到演示质量的限制，缺乏泛化能力，而强化学习（RL）虽然改善了探索，但常常导致不稳定、抖动的轨迹，违反物理约束。为了解决这一问题，我们提出了SmoothVLA，一种新颖的强化学习微调框架，协同优化任务性能和运动平滑性。其技术核心是一个物理信息混合奖励函数，该函数将二元稀疏任务奖励与基于轨迹抖动的连续密集项相结合。关键在于，这一奖励是内在的，直接从策略回放中计算，而无需外部环境反馈或繁琐的奖励工程。通过利用群体相对策略优化（GRPO），SmoothVLA将轨迹平滑性确立为明确的优化先验，引导模型朝向物理可行和稳定的控制。我们在LIBERO基准上的大量实验表明，SmoothVLA在平滑性方面比标准RL提高了13.8%，并在多样化任务的泛化能力上显著超越SFT。我们的工作提供了一种可扩展的方法，通过内在奖励优化将VLA模型与物理世界的约束对齐。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2603.13944

ToMPC: Task-oriented Model Predictive Control via ADMM for Safe Robotic Manipulation

ToMPC：通过ADMM实现安全机器人操作的任务导向模型预测控制

Jia, Xinyu, Wang, Wenxin, Yang, Jun, Pan, Yongping, Yu, Haoyong

Abstract

This paper proposes a task-oriented model predictive control (ToMPC) framework for safe and efficient robotic manipulation in open workspaces. The framework unifies collision-free motion and robot-environment interaction to address diverse scenarios. Additionally, it introduces task-oriented obstacle avoidance that leverages kinematic redundancy to enhance manipulation efficiency in obstructed environments. This complex optimization problem is solved by the alternating direction method of multipliers (ADMM), which decomposes the problem into two subproblems tackled by differential dynamic programming (DDP) and quadratic programming (QP), respectively. The effectiveness of this approach is validated in simulation and hardware experiments on a Franka Panda robotic manipulator. Results demonstrate that the framework can plan motion and/or force trajectories in real time, maximize the manipulation range while avoiding obstacles, and strictly adhere to safety-related hard constraints.

Chinese Translation

本文提出了一种任务导向模型预测控制（ToMPC）框架，用于在开放工作空间中实现安全高效的机器人操作。该框架统一了无碰撞运动与机器人与环境的交互，以应对多样化的场景。此外，它引入了任务导向的障碍物规避，利用运动学冗余来提高在受阻环境中的操作效率。这个复杂的优化问题通过交替方向乘子法（ADMM）进行求解，该方法将问题分解为两个子问题，分别由微分动态规划（DDP）和二次规划（QP）解决。通过对Franka Panda机器人操纵器的仿真和硬件实验验证了该方法的有效性。结果表明，该框架能够实时规划运动和/或力轨迹，最大化操作范围，同时避免障碍物，并严格遵守与安全相关的硬约束。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2603.13987

Vision-guided Autonomous Dual-arm Extraction Robot for Bell Pepper Harvesting

视觉引导的自主双臂采摘机器人用于甜椒收获

Bhat, Kshitij Madhav, Gao, Tom, Mathur, Abhishek, Satishkumar, Rohit, Yandun, Francisco, Bauer, Dominik, Pollard, Nancy

Abstract

Agricultural robotics has emerged as a critical solution to the labor shortages and rising costs associated with manual crop harvesting. Bell pepper harvesting, in particular, is a labor-intensive task, accounting for up to 50% of total production costs. While automated solutions have shown promise in controlled greenhouse environments, harvesting in unstructured outdoor farms remains an open challenge due to environmental variability and occlusion. This paper presents VADER (Vision-guided Autonomous Dual-arm Extraction Robot), a dual-arm mobile manipulation system designed specifically for the autonomous harvesting of bell peppers in outdoor environments. The system integrates a robust perception pipeline coupled with a dual-arm planning framework that coordinates a gripping arm and a cutting arm for extraction. We validate the system through trials in various realistic conditions, demonstrating a harvest success rate exceeding 60% with a cycle time of under 100 seconds per fruit, while also featuring a teleoperation fail-safe based on the GELLO teleoperation framework to ensure robustness. To support robust perception, we contribute a hierarchically structured dataset of over 3,200 images spanning indoor and outdoor domains, pairing wide-field scene images with close-up pepper images to enable a coarse-to-fine training strategy from fruit detection to high-precision pose estimation. The code and dataset will be made publicly available upon acceptance.

Chinese Translation

农业机器人已成为解决人工收割中劳动力短缺和成本上升的重要方案。甜椒收获尤其是一项劳动密集型任务，占总生产成本的50%。尽管自动化解决方案在受控的温室环境中显示出潜力，但在非结构化的户外农场进行收获仍然是一个开放的挑战，原因在于环境的变化和遮挡。本文提出了VADER（视觉引导的自主双臂采摘机器人），这是一个专门为户外环境中的甜椒自主收获设计的双臂移动操作系统。该系统集成了强大的感知管道，并结合了双臂规划框架，以协调抓取臂和切割臂进行采摘。我们通过在各种现实条件下的试验验证了该系统，展示了超过60%的收获成功率，单果循环时间低于100秒，同时还基于GELLO远程操作框架提供了远程操作的安全保障，以确保系统的稳健性。为了支持强大的感知能力，我们贡献了一个层次结构的数据集，包含超过3200张图像，涵盖室内和室外领域，将广角场景图像与特写甜椒图像配对，以实现从果实检测到高精度姿态估计的粗到细训练策略。代码和数据集将在接受后公开发布。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2603.14010

URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation

URDF-Anything+: 自回归关节3D模型生成用于物理仿真

Wu, Zhuangzhe, Xin, Yue, Hou, Chengkai, Chen, Minghao, Lyu, Yaoxu, Zhang, Jieyu, Zhang, Shanghang

Abstract

Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure. We present, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics. Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation. Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.

Chinese Translation

关节对象在机器人技术、物理仿真和交互式虚拟环境中具有基础性的重要性。然而，从视觉输入重建这些对象仍然具有挑战性，因为这需要同时推断部件几何形状和运动结构。我们提出了一种端到端的自回归框架，该框架直接从视觉观察中生成可执行的关节对象模型。给定图像和对象级的3D线索，我们的方法顺序生成部件几何形状及其相关的关节参数，从而生成完整的URDF模型，而无需依赖多阶段管道。生成过程持续进行，直到模型确定所有部件都已生成，自动推断完整的几何形状和运动学。基于这一能力，我们实现了一种新的真实跟随仿真（Real-Follow-Sim）范式，其中从视觉观察构建的高保真数字双胞胎使得在纯仿真中训练和测试的策略能够无缝转移到真实机器人上，而无需在线适应。在大规模关节对象基准测试和真实世界机器人任务中的实验表明，我们的方法在几何重建质量、关节参数准确性和物理可执行性方面优于先前的方法。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2603.14056

Amortizing Trajectory Diffusion with Keyed Drift Fields

利用键控漂移场的轨迹扩散摊销

Puthumanaillam, Gokul, Ornik, Melkior

Abstract

Diffusion-based trajectory planners can synthesize rich, multimodal action sequences for offline reinforcement learning, but their iterative denoising incurs substantial inference-time cost, making closed-loop planning slow under tight compute budgets. We study the problem of achieving diffusion-like trajectory planning behavior with one-step inference, while retaining the ability to sample diverse candidate plans and condition on the current state in a receding-horizon control loop. Our key observation is that conditional trajectory generation fails under na\"ive distribution-matching objectives when the similarity measure used to align generated trajectories with the dataset is dominated by unconstrained future dimensions. In practice, this causes attraction toward average trajectories, collapses action diversity, and yields near-static behavior. Our key insight is that conditional generative planning requires a conditioning-aware notion of neighborhood: trajectory updates should be computed using distances in a compact key space that reflects the condition, while still applying updates in the full trajectory space. Building on this, we introduce Keyed Drifting Policies (KDP), a one-step trajectory generator trained with a drift-field objective that attracts generated trajectories toward condition-matched dataset windows and repels them from nearby generated samples, using a stop-gradient drifted target to amortize iterative refinement into training. At inference, the resulting policy produces a full trajectory window in a single forward pass. Across standard RL benchmarks and real-time hardware deployments, KDP achieves strong performance with one-step inference and substantially lower planning latency than diffusion sampling. Project website, code and videos: https://keyed-drifting.github.io/

Chinese Translation

基于扩散的轨迹规划器能够为离线强化学习合成丰富的多模态动作序列，但其迭代去噪过程在推理时会产生显著的成本，使得在紧张的计算预算下闭环规划变得缓慢。我们研究了在保持能够采样多样候选计划并在递归控制循环中以当前状态为条件的同时，实现类似扩散的轨迹规划行为的问题。我们的关键观察是，当用于将生成的轨迹与数据集对齐的相似性度量被不受约束的未来维度主导时，条件轨迹生成在简单的分布匹配目标下会失败。在实践中，这导致了对平均轨迹的吸引，动作多样性的崩溃，并产生近乎静态的行为。我们的关键见解是，条件生成规划需要一种关注条件的邻域概念：轨迹更新应使用反映条件的紧凑键空间中的距离进行计算，同时仍在完整的轨迹空间中应用更新。在此基础上，我们引入了键控漂移策略（Keyed Drifting Policies, KDP），这是一种通过漂移场目标训练的一步轨迹生成器，该目标将生成的轨迹吸引到与条件匹配的数据集窗口，并将其从附近生成的样本中排斥，使用停止梯度漂移目标将迭代精炼摊销到训练中。在推理时，所得到的策略在一次前向传播中生成完整的轨迹窗口。在标准强化学习基准测试和实时硬件部署中，KDP在一步推理下实现了强大的性能，并显著降低了规划延迟，相较于扩散采样。项目网站、代码和视频：https://keyed-drifting.github.io/

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2603.14068

Stiffness Copilot: An Impedance Policy for Contact-Rich Teleoperation

刚度副驾驶：一种用于接触丰富的远程操作的阻抗策略

Wang, Yeping, Xu, Zhengtong, Preechayasomboon, Pornthep, Abbatematteo, Ben, Memar, Amirhossein H., Colonnese, Nick, Chan, Sonny

Abstract

In teleoperation of contact-rich manipulation tasks, selecting robot impedance is critical but difficult. The robot must be compliant to avoid damaging the environment, but stiff to remain responsive and to apply force when needed. In this paper, we present Stiffness Copilot, a vision-based policy for shared-control teleoperation in which the operator commands robot pose and the policy adjusts robot impedance online. To train Stiffness Copilot, we first infer direction-dependent stiffness matrices in simulation using privileged contact information. We then use these matrices to supervise a lightweight vision policy that predicts robot stiffness from wrist-camera images and transfers zero-shot to real images at runtime. In a human-subject study, Stiffness Copilot achieved safety comparable to using a constant low stiffness while matching the efficiency of using a constant high stiffness.

Chinese Translation

在接触丰富的操作任务的远程操作中，选择机器人阻抗至关重要但又困难。机器人必须具备柔顺性以避免对环境造成损害，但又需要具备刚性以保持响应能力并在必要时施加力量。本文提出了刚度副驾驶（Stiffness Copilot），这是一种基于视觉的共享控制远程操作策略，其中操作员指挥机器人姿态，而策略在线调整机器人阻抗。为了训练刚度副驾驶，我们首先在模拟环境中利用特权接触信息推导方向依赖的刚度矩阵。然后，我们使用这些矩阵来监督一个轻量级视觉策略，该策略从腕部摄像头图像中预测机器人刚度，并在运行时无缝转移到真实图像中。在一项人类受试者研究中，刚度副驾驶实现了与使用恒定低刚度相当的安全性，同时匹配了使用恒定高刚度的效率。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2603.14104

GelSphere: An Omnidirectional Rolling Vision-Based Tactile Sensor for Online 3D Reconstruction and Normal Force Estimation

GelSphere：一种基于视觉的全向滚动触觉传感器，用于在线三维重建和法向力估计

Lee, Seoyeon, Mirzaee, Mohammad Amin, Yuan, Wenzhen

Abstract

We present GelSphere, a spherical vision-based tactile sensor designed for real-time continuous surface scanning. Unlike traditional vision-based tactile sensors that can only sense locally and are damaged when slid across surfaces, and cylindrical tactile sensors that can only roll along a fixed direction, our design enables omnidirectional rolling on surfaces. We accomplish this through our novel sensing system design, which has steel balls inside the sensor, forming a bearing layer between the gel and the rigid housing that allows rolling motion in all axes. The sensor streams tactile images through Wi-Fi, with online large-surface reconstruction capabilities. We present quantitative results for both reconstruction accuracy and image fusion performance. The results show that our sensor maintains geometric fidelity and high reconstruction accuracy even under multi-directional rolling, enabling uninterrupted surface scanning.

Chinese Translation

我们提出了GelSphere，一种设计用于实时连续表面扫描的球形基于视觉的触觉传感器。与传统的只能局部感知并在滑动表面时容易损坏的基于视觉的触觉传感器，以及只能沿固定方向滚动的圆柱形触觉传感器不同，我们的设计实现了在表面上的全向滚动。我们通过新颖的传感系统设计实现了这一点，该设计在传感器内部设置了钢球，形成了一个在胶体和刚性外壳之间的轴承层，使得在所有轴向上都可以实现滚动运动。传感器通过Wi-Fi实时传输触觉图像，并具备在线大面积重建的能力。我们展示了重建精度和图像融合性能的定量结果。结果表明，即使在多方向滚动的情况下，我们的传感器仍能保持几何保真度和高重建精度，从而实现不间断的表面扫描。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2603.14109

H-RINS: Hierarchical Tightly-coupled Radar-Inertial Navigation via Smoothing and Mapping

H-RINS：通过平滑和映射实现分层紧耦合雷达-惯性导航

Abdulkarim, Ali Alridha, Litvinov, Mikhail, Tsetserukou, Dzmitry

Abstract

Millimeter-wave radar provides robust perception in visually degraded environments. However, radar-inertial state estimation is inherently susceptible to drift. Because radar yields only sparse, body-frame velocity measurements, it provides weak constraints on absolute orientation. Consequently, IMU biases remain poorly observable over the short time horizons typical of sliding-window filters. To address this fundamental observability challenge, we propose a tightly coupled, hierarchical radar-inertial factor graph framework. Our architecture decouples the estimation problem into a high-rate resetting graph and a persistent global graph. The resetting graph fuses IMU preintegration, radar velocities, and adaptive Zero-Velocity Updates (ZUPT) to generate the smooth, low-latency odometry required for real-time control. Concurrently, the persistent graph is a full-state factor graph maintaining the complete information of poses, velocities, and biases by fusing inertial data with keyframe-based geometric mapping and loop closures. Leveraging Incremental Smoothing and Mapping, the persistent graph can operate without explicit marginalization of variables, preserving their information while ensuring long-term bias observability. The cornerstone of our approach is a probabilistic tight-coupling mechanism: fully observable, optimized biases and their exact covariances are continuously injected from the persistent graph into the resetting graph's prior, effectively anchoring the high-rate estimator against integration drift. Extensive evaluations demonstrate our system achieves high accuracy with drift-reduced estimation at 27x real-time execution speeds. We release the implementation code and datasets upon the acceptance of the paper.

Chinese Translation

毫米波雷达在视觉退化环境中提供了强大的感知能力。然而，雷达-惯性状态估计本质上容易受到漂移的影响。由于雷达仅提供稀疏的机体框架速度测量，因此对绝对方向的约束较弱。因此，IMU（惯性测量单元）偏差在滑动窗口滤波器的短时间范围内难以被良好观测。为了解决这一基本的可观测性挑战，我们提出了一种紧耦合的分层雷达-惯性因子图框架。我们的架构将估计问题解耦为一个高频重置图和一个持久全局图。重置图融合了IMU预积分、雷达速度和自适应零速度更新（ZUPT），以生成实时控制所需的平滑、低延迟的里程计。同时，持久图是一个全状态因子图，通过将惯性数据与基于关键帧的几何映射和回环闭合融合，保持姿态、速度和偏差的完整信息。利用增量平滑和映射，持久图可以在不显式边缘化变量的情况下运行，保留其信息，同时确保长期偏差的可观测性。我们方法的基石是一个概率紧耦合机制：完全可观测的优化偏差及其精确协方差不断从持久图注入到重置图的先验中，有效地锚定高频估计器以抵抗积分漂移。广泛的评估表明，我们的系统在27倍实时执行速度下实现了高精度和降低漂移的估计。我们将在论文被接受后发布实现代码和数据集。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2603.14156

TransCurriculum: Multi-Dimensional Curriculum Learning for Fast & Stable Locomotion

TransCurriculum：用于快速与稳定运动的多维课程学习

Mishra, Prakhar, Raj, Amir Hossain, Xiao, Xuesu, Manocha, Dinesh

Abstract

High-speed legged locomotion struggles with stability and transfer losses at higher command velocities during deployment. One reason is that most curricula vary difficulty along single axis, for example increase the range of command velocities, terrain difficulty, or domain parameters (e.g. friction or payload mass) using either fixed update rule or instantaneous rewards while ignoring how the history of robot training has evolved. We propose TransCurriculum, a transformer-based multi-dimensional curriculum learning approach for agile quadrupedal locomotion. TransCurriculum adapts to 3 axes, velocity command targets, terrain difficulty, and domain randomization parameters (friction and payload mass). Rather than feeding task reward history directly into the low-level control policy, our formulation exploits it at the curriculum level. A transformer-based teacher retrieves the sequence of rewards and uses it to predict future rewards, success rate, and learning progress to guide expansion of this multidimensional curriculum towards high performing task bins. Finally we validate our approach on the Unitree Go1 robot in simulation (Isaac Gym) and deploy it zero-shot on Go1 hardware. Our TransCurriculum policy achieves a maximum velocity of 6.3 m/s in simulation and outperforms prior curriculum baselines. We tested our TransCurriculum trained policy on terrains (carpets, slopes, tiles, concrete), achieving a forward velocity of 4.1 m/s on carpet surpassing the fastest curriculum methods by 18.8% and achieves maximum zero-shot value among all tested methods. Our multi-dimensional curriculum also reduces the transfer loss to 18% from 27% for command only curriculum, demonstrating the benefits of joint training over velocity, terrain and domain randomization dimension while keeping the task success rate of 80-90% on rigid indoor and outdoor surfaces.

Chinese Translation

高速腿部运动在部署过程中面临着在较高指令速度下的稳定性和转移损失问题。其原因之一是大多数课程在单一轴线上变化难度，例如通过固定更新规则或瞬时奖励来增加指令速度范围、地形难度或领域参数（如摩擦或载荷质量），而忽视了机器人训练历史的演变。我们提出了TransCurriculum，一种基于变换器的多维课程学习方法，旨在实现灵活的四足运动。TransCurriculum适应于三个轴：速度指令目标、地形难度和领域随机化参数（摩擦和载荷质量）。我们的公式不是直接将任务奖励历史输入到低级控制策略中，而是在课程层面上利用它。基于变换器的教师检索奖励序列，并利用它来预测未来的奖励、成功率和学习进展，以指导这一多维课程向高性能任务区间的扩展。最后，我们在Unitree Go1机器人上进行了仿真验证（Isaac Gym），并在Go1硬件上实现了零-shot部署。我们的TransCurriculum策略在仿真中达到了6.3 m/s的最大速度，超越了先前的课程基线。我们在各种地形（地毯、坡道、瓷砖、混凝土）上测试了TransCurriculum训练的策略，在地毯上实现了4.1 m/s的前进速度，比最快的课程方法快18.8%，并在所有测试方法中实现了最高的零-shot值。我们的多维课程还将转移损失从仅基于指令的课程的27%降低到18%，证明了在速度、地形和领域随机化维度上联合训练的优势，同时在刚性室内和室外表面保持80-90%的任务成功率。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2603.14160

See, Learn, Assist: Safe and Self-Paced Robotic Rehabilitation via Video-Based Learning from Demonstration

观察、学习、辅助：通过基于视频的示范学习实现安全且自我节奏的机器人康复

Alabbas, Ali, Murgia, Camillo, Regan, Joanne, Long, Philip

Abstract

In this paper, we propose a novel framework that allows therapists to teach robot-assisted rehabilitation exercises remotely via RGB-D video. Our system encodes demonstrations as 6-DoF body-centric trajectories using Cartesian Dynamic Movement Primitives (DMPs), ensuring accurate posture-independent spatial generalization across diverse patient anatomies. Crucially, we execute these trajectories through a decoupled hybrid control architecture that constructs a spatially compliant virtual tunnel, paired with an effort-based temporal dilation mechanism. This architecture is applied to three distinct rehabilitation modalities: Passive, Active-Assisted, and Active-Resistive, by dynamically linking the exercise's execution phase to the patient's tangential force contribution. To guarantee safety, a Gaussian Mixture Regression (GMR) model is learned on-the-fly from the patient's own limb. This allows the detection of abnormal interaction forces and, if necessary, reverses the trajectory to prevent injury. Experimental validation demonstrates the system's precision, achieving an average trajectory reproduction error of 3.7cm and a range of motion (ROM) error of 5.5 degrees. Furthermore, dynamic interaction trials confirm that the controller successfully enforces effort-based progression while maintaining strict spatial path adherence against human disturbances.

Chinese Translation

本文提出了一种新颖的框架，使治疗师能够通过RGB-D视频远程教授机器人辅助的康复运动。我们的系统使用笛卡尔动态运动原型（DMPs）将示范编码为6自由度的以身体为中心的轨迹，确保在不同患者解剖结构下的准确姿势无关空间泛化。关键是，我们通过一个解耦的混合控制架构执行这些轨迹，该架构构建了一个空间合规的虚拟隧道，并配备了一种基于努力的时间扩展机制。该架构应用于三种不同的康复模式：被动、主动辅助和主动抵抗，通过动态连接运动的执行阶段与患者的切向力贡献。为了确保安全，从患者自身肢体上实时学习高斯混合回归（GMR）模型。这使得能够检测异常的交互力，并在必要时逆转轨迹以防止受伤。实验验证表明该系统的精确性，平均轨迹重现误差为3.7厘米，运动范围（ROM）误差为5.5度。此外，动态交互试验确认控制器成功地在保持严格的空间路径遵循的同时，实施基于努力的进展，抵御人类干扰。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2603.14182

Towards Equitable Robotic Furnishing Agents for Aging-in-Place: ADL-Grounded Design Exploration

面向居家养老的公平机器人家具代理：基于日常生活活动的设计探索

Lee, Hansoo, Seo, Changhee, Park, Subin, Kwak, Sonya S.

Abstract

In aging-in-place contexts, small difficulties in Activities of Daily Living (ADL) can accumulate, affecting well-being through fatigue, anxiety, reduced autonomy, and safety risks. This position paper argues that robotics for older adult wellbeing must move beyond "convenience features" and centre equity, justice, and responsibility. We conducted ADL-grounded semi-structured interviews with four adults in their 70s-80s, identifying recurrent challenges (finding/ organising items, taking medication, and transporting objects) and deriving requirements to reduce compounded cognitive-physical burden. Based on these insights, we propose an in-home robotic furnishing-agent concept leveraging computer vision and generative AI and LLMs for natural-language interaction, context-aware reminders, safe actuation, and user-centred transparency. We then report video-stimulated follow-up interviews with the same participants, highlighting preferences for confirmation before actuation, predictability, adjustable speed/autonomy, and multimodal feedback, as well as equity-related concerns. We conclude with open questions on evaluating and deploying equitable robotic wellbeing systems in real homes.

Chinese Translation

在居家养老的背景下，日常生活活动（ADL）中的小困难可能会积累，影响老年人的福祉，导致疲劳、焦虑、降低自主性和安全风险。本文认为，针对老年人福祉的机器人技术必须超越“便利功能”，以公平、正义和责任为中心。我们对四位70至80岁的成年人进行了基于ADL的半结构化访谈，识别出反复出现的挑战（寻找/组织物品、服药和运输物品），并提出减少认知-身体负担的需求。基于这些见解，我们提出了一种居家机器人家具代理的概念，利用计算机视觉和生成式人工智能（Generative AI）及大语言模型（LLMs）进行自然语言交互、上下文感知提醒、安全执行和以用户为中心的透明性。随后，我们报告了与同一参与者进行的视频刺激后续访谈，强调了他们对执行前确认、可预测性、可调节速度/自主性和多模态反馈的偏好，以及与公平相关的担忧。最后，我们提出了关于在真实家庭中评估和部署公平机器人福祉系统的开放性问题。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2603.14216

Navigation beyond Wayfinding: Robots Collaborating with Visually Impaired Users for Environmental Interactions

超越路径寻找的导航：机器人与视觉障碍用户协作进行环境交互

Cai, Shaojun, Janaka, Nuwan, Ram, Ashwin, Shehan, Janidu, Wan, Yingjia, Hara, Kotaro, Hsu, David

Abstract

Robotic guidance systems have shown promise in supporting blind and visually impaired (BVI) individuals with wayfinding and obstacle avoidance. However, most existing systems assume a clear path and do not support a critical aspect of navigation - environmental interactions that require manipulating objects to enable movement. These interactions are challenging for a human-robot pair because they demand (i) precise localization and manipulation of interaction targets (e.g., pressing elevator buttons) and (ii) dynamic coordination between the user's and robot's movements (e.g., pulling out a chair to sit). We present a collaborative human-robot approach that combines our robotic guide dog's precise sensing and localization capabilities with the user's ability to perform physical manipulation. The system alternates between two modes: lead mode, where the robot detects and guides the user to the target, and adaptation mode, where the robot adjusts its motion as the user interacts with the environment (e.g., opening a door). Evaluation results show that our system enables navigation that is safer, smoother, and more efficient than both a traditional white cane and a non-adaptive guiding system, with the performance gap widening as tasks demand higher precision in locating interaction targets. These findings highlight the promise of human-robot collaboration in advancing assistive technologies toward more generalizable and realistic navigation support.

Chinese Translation

机器人引导系统在支持盲人和视觉障碍（BVI）个体进行路径寻找和避障方面显示出良好的前景。然而，现有大多数系统假设存在清晰的路径，并未支持导航的一个关键方面——需要操控物体以实现移动的环境交互。这些交互对于人机配对来说具有挑战性，因为它们要求（i）对交互目标（例如，按电梯按钮）的精确定位和操控，以及（ii）用户与机器人动作之间的动态协调（例如，拉出椅子以坐下）。我们提出了一种协作的人机机器人方法，将我们的机器人导盲犬的精确感知和定位能力与用户的物理操控能力相结合。该系统在两种模式之间交替：引导模式，机器人检测并引导用户到达目标；适应模式，机器人在用户与环境交互时调整其运动（例如，打开门）。评估结果表明，我们的系统实现的导航比传统的白手杖和非适应性引导系统更安全、更顺畅且更高效，随着任务对交互目标定位精度要求的提高，性能差距进一步扩大。这些发现凸显了人机协作在推动辅助技术向更具普适性和现实性的导航支持发展的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2603.14221

A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation

用于自主机器人操作安全决策控制的实时神经符号伦理治理器

Aueawatthanaphisut, Aueaphum, Aueawatthanaphisut, Kuepon

Abstract

Ethical decision governance has become a critical requirement for autonomous robotic systems operating in human-centered and safety-sensitive environments. This paper presents a real-time neuro-symbolic ethical governor designed to enable risk-aware supervisory control in autonomous robotic manipulation tasks. The proposed framework integrates transformer-based ethical reasoning with a probabilistic ethical risk field formulation and a threshold-based override control mechanism. language-grounded ethical intent inference capability is learned from natural language task descriptions using a fine-tuned DistilBERT model trained on the ETHICS commonsense dataset. A continuous ethical risk metric is subsequently derived from predicted unsafe action probability, confidence uncertainty, and probabilistic variance to support adaptive decision filtering. The effectiveness of the proposed approach is validated through simulated autonomous robot-arm task scenarios involving varying levels of human proximity and operational hazard. Experimental results demonstrate stable model convergence, reliable ethical risk discrimination, and improved safety-aware decision outcomes without significant degradation of task execution efficiency. The proposed neuro-symbolic architecture further provides enhanced interpretability compared with purely data-driven safety filters, enabling transparent ethical reasoning in real-time control loops. The findings suggest that ethical decision governance can be effectively modeled as a dynamic supervisory risk layer for autonomous robotic systems, with potential applicability to broader cyber-physical and assistive robotics domains.

Chinese Translation

伦理决策治理已成为在以人为中心和安全敏感环境中运行的自主机器人系统的关键需求。本文提出了一种实时神经符号伦理治理器，旨在实现自主机器人操作任务中的风险感知监督控制。所提出的框架将基于变换器的伦理推理与概率伦理风险场的构造以及基于阈值的覆盖控制机制相结合。通过使用在ETHICS常识数据集上训练的微调DistilBERT模型，从自然语言任务描述中学习语言基础的伦理意图推断能力。随后，从预测的不安全行为概率、置信不确定性和概率方差中导出连续的伦理风险指标，以支持自适应决策过滤。通过模拟自主机器人臂任务场景，验证了所提方法的有效性，这些场景涉及不同程度的人类接近和操作风险。实验结果表明，模型收敛稳定，伦理风险区分可靠，并且在不显著降低任务执行效率的情况下改善了安全感知的决策结果。与纯数据驱动的安全过滤器相比，所提出的神经符号架构提供了更好的可解释性，使实时控制环路中的伦理推理更加透明。研究结果表明，伦理决策治理可以有效地建模为自主机器人系统的动态监督风险层，并具有在更广泛的网络物理和辅助机器人领域应用的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2603.14236

AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK

AeroGen：通过单次结构化提示和无人机SDK实现自主无人机的智能化

Astu, Kautuk, Simmhan, Yogesh

Abstract

Designing correct UAV autonomy programs is challenging due to joint navigation, sensing and analytics requirements. While LLMs can generate code, their reliability for safety-critical UAVs remains uncertain. This paper presents AeroGen, an open-loop framework that enables consistently correct single-shot AI-generated drone control programs through structured guardrail prompting and integration with the AeroDaaS drone SDK. AeroGen encodes API descriptions, flight constraints and operational world rules directly into the system context prompt, enabling generic LLMs to produce constraint-aware code from user prompts, with minimal example code. We evaluate AeroGen across a diverse benchmark of 20 navigation tasks and 5 drone missions on urban, farm and inspection environments, using both imperative and declarative user prompts. AeroGen generates about 40 lines of AeroDaaS Python code in about 20s per mission, in both real-world and simulations, showing that structured prompting with a well-defined SDK improves robustness, correctness and deployability of LLM-generated drone autonomy programs.

Chinese Translation

设计正确的无人机自主程序面临挑战，因为需要联合导航、感知和分析。尽管大型语言模型（LLMs）能够生成代码，但其在安全关键的无人机上的可靠性仍然不确定。本文提出了AeroGen，一个开放式框架，通过结构化的保护提示和与AeroDaaS无人机SDK的集成，使得AI生成的无人机控制程序能够始终如一地正确生成。AeroGen将API描述、飞行约束和操作世界规则直接编码到系统上下文提示中，使得通用的LLMs能够从用户提示中生成考虑约束的代码，且只需最少的示例代码。我们在多样化的基准测试中评估了AeroGen，涵盖20个导航任务和5个无人机任务，涉及城市、农田和检查环境，使用了命令式和声明式的用户提示。AeroGen在每个任务中生成约40行AeroDaaS Python代码，耗时约20秒，无论是在真实世界还是模拟环境中，结果表明，使用结构化提示和定义明确的SDK能够提高LLM生成的无人机自主程序的鲁棒性、正确性和可部署性。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2603.14244

Design of a Bio-Inspired Miniature Submarine for Low-Cost Water Quality Monitoring

一种仿生微型潜艇的设计用于低成本水质监测

Vu, Quang Huy, Le, Quan, Phung, Manh Duong

Abstract

Water quality monitoring is essential for protecting aquatic ecosystems and detecting environmental pollution. This paper presents the design and experimental validation of a bio-inspired miniature submarine for low-cost water quality monitoring. Inspired by the jet propulsion mechanism of squids, the proposed system employs pump-driven water jets for propulsion and steering, combined with a pump-based buoyancy control mechanism that enables both depth regulation and water sampling. The vehicle integrates low-cost, commercially available components including an ESP32 microcontroller, IMU, pressure sensor, GPS receiver, and LoRa communication module. The complete system can be constructed at a hardware cost of approximately $122.5, making it suitable for educational and environmental monitoring applications. Experimental validation was conducted through pool tests and field trials in a lake. During a 360 degrees rotation test, roll and pitch deviations remained within +/-2 degrees and +/-1.5 degrees, respectively, demonstrating stable attitude control. Steering experiments showed a heading step response with approximately 2 s rise time and 5 s settling time. Depth control experiments achieved a target depth of 2.5 m with steady-state error within +/-0.1 m. Field experiments further demonstrated reliable navigation and successful water sampling operations. The results confirm that the proposed platform provides a compact, stable, and cost-effective solution for small-scale aquatic environmental monitoring.

Chinese Translation

水质监测对于保护水生生态系统和检测环境污染至关重要。本文提出了一种仿生微型潜艇的设计及其实验验证，旨在实现低成本的水质监测。该系统受到鱿鱼喷射推进机制的启发，采用泵驱动水射流进行推进和转向，并结合泵基浮力控制机制，实现深度调节和水样采集。该潜艇集成了低成本的商业可用组件，包括ESP32微控制器、惯性测量单元（IMU）、压力传感器、GPS接收器和LoRa通信模块。整个系统的硬件成本约为122.5美元，适合用于教育和环境监测应用。通过水池测试和湖泊实地试验进行了实验验证。在360度旋转测试中，横滚和俯仰偏差分别保持在+/-2度和+/-1.5度以内，证明了稳定的姿态控制。转向实验显示，航向阶跃响应的上升时间约为2秒，稳定时间为5秒。深度控制实验实现了目标深度2.5米，稳态误差在+/-0.1米以内。实地实验进一步证明了可靠的导航能力和成功的水样采集操作。结果确认所提出的平台为小规模水域环境监测提供了一种紧凑、稳定且具有成本效益的解决方案。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2603.14308

Load-Aware Locomotion Control for Humanoid Robots in Industrial Transportation Tasks

工业运输任务中基于负载感知的类人机器人运动控制

Fu, Lequn, Zhong, Yijun, Li, Xiao, Liu, Yibin, Xu, Zhiyuan, Tang, Jian, Li, Shiqi

Abstract

Humanoid robots deployed in industrial environments are required to perform load-carrying transportation tasks that tightly couple locomotion and manipulation. However, achieving stable and robust locomotion under varying payloads and upper-body motions is challenging due to dynamic coupling and partial observability. This paper presents a load-aware locomotion framework for industrial humanoids based on a decoupled yet coordinated loco-manipulation architecture. Lower-body locomotion is controlled via a reinforcement learning policy producing residual joint actions on kinematically derived nominal configurations. A kinematics-based locomotion reference with a height-conditioned joint-space offset guides learning, while a history-based state estimator infers base linear velocity and height and encodes residual load- and manipulation-induced disturbances in a compact latent representation. The framework is trained entirely in simulation and deployed on a full-size humanoid robot without fine-tuning. Simulation and real-world experiments demonstrate faster training, accurate height tracking, and stable loco-manipulation. Project page: https://lequn-f.github.io/LALO/

Chinese Translation

在工业环境中部署的类人机器人需要执行负载运输任务，这些任务紧密结合了运动和操作。然而，由于动态耦合和部分可观测性，在不同负载和上半身运动下实现稳定和鲁棒的运动控制具有挑战性。本文提出了一种基于解耦但协调的运动-操作架构的工业类人机器人负载感知运动框架。下肢运动通过强化学习策略进行控制，该策略在运动学推导的名义配置上产生残余关节动作。基于运动学的运动参考与高度条件的关节空间偏移指导学习，同时基于历史的状态估计器推断基础线速度和高度，并在紧凑的潜在表示中编码残余负载和操作引起的扰动。该框架完全在仿真中训练，并在全尺寸类人机器人上部署，无需微调。仿真和现实世界实验表明，训练速度更快、高度跟踪准确以及运动-操作稳定性。项目页面：https://lequn-f.github.io/LALO/

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2603.14327

OmniClone: Engineering a Robust, All-Rounder Whole-Body Humanoid Teleoperation System

OmniClone：构建一个稳健的全能型全身类人机器人遥操作系统

Li, Yixuan, Ma, Le, Lin, Yutang, Du, Yushi, Liu, Mengya, Hu, Kaizhe, Cui, Jieming, Zhu, Yixin, Liang, Wei, Jia, Baoxiong, Huang, Siyuan

Abstract

Whole-body humanoid teleoperation enables humans to remotely control humanoid robots, serving as both a real-time operational tool and a scalable engine for collecting demonstrations for autonomous learning. Despite recent advances, existing systems are validated using aggregate metrics that conflate distinct motion regimes, masking critical failure modes. This lack of diagnostic granularity, compounded by tightly coupled and labor-intensive system configurations, hinders robust real-world deployment. A key open challenge is building a teleoperation system that is simultaneously robust, versatile, and affordable for practical use. Here we present OmniClone, a whole-body humanoid teleoperation system that achieves high-fidelity, multi-skill control on a single consumer GPU with modest data requirements. Central to our approach is OmniBench, a diagnostic benchmark that evaluates policies across stratified motion categories and difficulty levels on unseen motions, exposing the narrow specialization of prior systems. Guided by these diagnostics, we identify an optimized training data recipe and integrate system-level improvements: subject-agnostic retargeting and robust communication, that collectively reduce Mean Per-Joint Position Error (MPJPE) by over 66% while requiring orders-of-magnitude fewer computational resources than comparable methods. Crucially, OmniClone is control-source-agnostic: a single unified policy supports real-time teleoperation, generated motion playback, and Vision-Language-Action (VLA) models, while generalizing across operators of vastly different body proportions. By uniting diagnostic evaluation with practical engineering, OmniClone provides an accessible foundation for scalable humanoid teleoperation and autonomous learning.

Chinese Translation

全身类人机器人遥操作使人类能够远程控制类人机器人，既作为实时操作工具，又作为收集自主学习演示的可扩展引擎。尽管最近取得了进展，现有系统仍然使用聚合指标进行验证，这些指标混淆了不同的运动模式，掩盖了关键的故障模式。这种缺乏诊断细粒度的问题，加上紧密耦合和劳动密集型的系统配置，阻碍了稳健的现实世界部署。一个关键的开放挑战是构建一个同时稳健、多功能且经济实用的遥操作系统。在此，我们提出了OmniClone，一个全身类人机器人遥操作系统，它在单个消费级GPU上实现了高保真度的多技能控制，并且数据需求适中。我们方法的核心是OmniBench，一个诊断基准，评估在未见运动上的策略，涵盖分层运动类别和难度水平，揭示了先前系统的狭窄专业化。在这些诊断的指导下，我们确定了优化的训练数据配方，并整合了系统级改进：与主体无关的重定向和稳健的通信，这共同将每个关节位置误差的平均值（Mean Per-Joint Position Error, MPJPE）降低了超过66%，同时所需的计算资源比可比方法少几个数量级。至关重要的是，OmniClone是控制源无关的：一个统一的策略支持实时遥操作、生成的运动播放和视觉-语言-动作（Vision-Language-Action, VLA）模型，同时在不同体型的操作员之间实现泛化。通过将诊断评估与实际工程相结合，OmniClone为可扩展的类人机器人遥操作和自主学习提供了一个可访问的基础。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2603.14333

Data-Driven Physics Embedded Dynamics with Predictive Control and Reinforcement Learning for Quadrupeds

基于数据驱动的物理嵌入动态与预测控制和强化学习结合的四足机器人

Kotecha, Prakrut, Shirwatkar, Aditya, Kolathaya, Shishir

Abstract

State of the art quadrupedal locomotion approaches integrate Model Predictive Control (MPC) with Reinforcement Learning (RL), enabling complex motion capabilities with planning and terrain adaptive behaviors. However, they often face compounding errors over long horizons and have limited interpretability due to the absence of physical inductive biases. We address these issues by integrating Lagrangian Neural Networks (LNNs) into an RL MPC framework, enabling physically consistent dynamics learning. At deployment, our inverse dynamics infinite horizon MPC scheme avoids costly matrix inversions, improving computational efficiency by up to 4x with minimal loss of task performance. We validate our framework through multiple ablations of the proposed LNN and its variants. We show improved sample efficiency, reduced long-horizon error, and faster real time planning compared to unstructured neural dynamics. Lastly, we also test our framework on the Unitree Go1 robot to show real world viability.

Chinese Translation

最先进的四足 locomotion 方法将模型预测控制 (MPC) 与强化学习 (RL) 相结合，使得具备复杂运动能力以及规划和地形自适应行为。然而，这些方法通常在长时间预期中面临累积误差，并且由于缺乏物理归纳偏见而导致可解释性有限。我们通过将拉格朗日神经网络 (LNN) 集成到RL MPC框架中来解决这些问题，从而实现物理一致的动态学习。在部署阶段，我们的逆动态无穷远期 MPC 方案避免了成本高昂的矩阵求逆，提高了计算效率，最多可提升 4 倍，同时任务性能损失最小。我们通过对所提 LNN 及其变体进行多次消融实验来验证我们的框架。与非结构化神经动态相比，我们展示了改进的样本效率、降低的长时间预期误差以及更快的实时规划。最后，我们还在Unitree Go1机器人上测试了我们的框架，以展示其在现实世界中的可行性。

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2603.14345

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

VIP-Loco：一种针对腿部运动的视觉引导无限视野规划框架

Shirwatkar, Aditya, Gupta, Satyam, Kolathaya, Shishir

Abstract

Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.

Chinese Translation

腿部机器人感知运动需要预判并适应复杂的动态环境。模型预测控制（Model Predictive Control, MPC）作为一个强有力的基线，提供了可解释的运动规划和约束执行，但在处理高维感知输入和快速变化的地形时表现不佳。相比之下，无模型强化学习（Reinforcement Learning, RL）在视觉挑战场景中适应性较强，但缺乏规划能力。为了解决这一问题，我们提出了VIP-Loco，一个将基于视觉的场景理解与RL和规划相结合的框架。在训练过程中，内部模型将本体感知状态和深度图像映射为紧凑的运动动力学特征，这些特征被RL策略使用。在部署阶段，学习到的模型被应用于无限视野的MPC公式中，结合了适应性与结构化规划。我们在仿真中验证了VIP-Loco在具有挑战性的运动任务上的表现，包括坡道、楼梯、爬行、倾斜、跨越间隙和攀爬，涉及三种机器人形态：四足机器人（Unitree Go1）、双足机器人（Cassie）和轮式双足机器人（TronA1-W）。通过消融实验和与最先进方法的比较，我们展示了VIP-Loco统一了规划与感知，使得在多样环境中实现稳健且可解释的运动成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2603.14371

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

OxyGen：在多任务并行下的视觉-语言-动作模型统一KV缓存管理

Li, Xiangyu, Tang, Huaizhi, Ding, Xin, Wang, Weijun, Cao, Ting, Liu, Yunxin

Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $\pi_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

Chinese Translation

具身人工智能代理越来越需要在不同时间约束下，从共享观察中并行执行多个任务，如操作、对话和记忆构建。最近的混合变换器（Mixture-of-Transformers, MoT）视觉-语言-动作模型（Vision-Language-Action Models, VLAs）在架构上支持这种异构输出，但现有的推理系统由于冗余计算和资源争用，未能实现高效的多任务并行，尤其在设备端部署时。我们识别出孤立的KV缓存管理是根本原因。为了解决这个问题，我们提出了统一的KV缓存管理，这是一种将KV缓存视为跨任务和时间的第一类共享资源的推理范式。这一抽象使得两个关键优化成为可能：跨任务的KV共享消除了对共享观察的冗余预填充，而跨帧连续批处理则将可变长度的语言解码与固定速率的动作生成在控制周期中解耦。我们为最流行的MoT VLA $ ext{π}_{0.5}$ 实现了这一范式，并在具有代表性的机器人配置下进行了评估。OxyGen在孤立执行的基础上实现了最高3.7倍的加速，同时提供超过200个tokens/s的语言吞吐量和70 Hz的动作频率，而不降低动作质量。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2603.14393

From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning

从扫描指南到行动：基于大型语言模型的机器人超声代理

Bi, Yuan, Zhou, Yiping, Liu, Pei, Li, Feng, Jiang, Zhongliang, Navab, Nassir

Abstract

Robotic ultrasound offers advantages over free-hand scanning, including improved reproducibility and reduced operator dependency. In clinical practice, US acquisition relies heavily on the sonographer's experience and situational judgment. When transferring this process to robotic systems, such expertise is often encoded explicitly through fixed procedures and task-specific models, yielding pipelines that can be difficult to adapt to new scanning tasks. In this work, we propose a unified framework for autonomous robotic US scanning that leverages a LLM-based agent to interpret US scanning guidelines and execute scans by dynamically invoking a set of provided software tools. Instead of encoding fixed scanning procedures, the LLM agent retrieves and reasons over guideline steps from scanning handbooks and adapts its planning decisions based on observations and the current scanning state. This enables the system to handle variable and decision-dependent workflows, such as adjusting scanning strategies, repeating steps, or selecting the appropriate next tool call in response to image quality or anatomical findings. Because the reasoning underlying tool selection is also critical for transparent and trustworthy planning, we further fine tune the LLM agent using a RL based strategy to improve both its reasoning quality and the correctness of tool selection and parameterization, while maintaining robust generalization to unseen guidelines and related tasks. We first validate the approach via verbal execution on 10 US scanning guidelines, assessing reasoning as well as tool selection and parameterization, and showing the benefit of RL fine tuning. We then demonstrate real world feasibility on robotic scanning of the gallbladder, spine, and kidney. Overall, the framework follows diverse guidelines and enables reliable autonomous scanning across multiple anatomical targets within a unified system.

Chinese Translation

机器人超声相较于手动扫描具有更好的可重复性和降低操作员依赖性的优势。在临床实践中，超声图像采集在很大程度上依赖于超声技师的经验和情境判断。当将这一过程转移到机器人系统时，这种专业知识通常通过固定程序和特定任务模型显式编码，从而产生难以适应新扫描任务的流程。在本研究中，我们提出了一个统一的自主机器人超声扫描框架，该框架利用基于大型语言模型（LLM）的代理来解释超声扫描指南，并通过动态调用一组提供的软件工具来执行扫描。与其编码固定的扫描程序，LLM代理从扫描手册中检索和推理指南步骤，并根据观察结果和当前扫描状态调整其规划决策。这使得系统能够处理可变和依赖决策的工作流程，例如调整扫描策略、重复步骤或根据图像质量或解剖发现选择适当的下一个工具调用。由于工具选择背后的推理对于透明和可信的规划至关重要，我们进一步使用基于强化学习（RL）的策略对LLM代理进行微调，以提高其推理质量以及工具选择和参数设置的正确性，同时保持对未见指南和相关任务的强健泛化。我们首先通过对10个超声扫描指南的口头执行来验证该方法，评估推理、工具选择和参数设置，并展示RL微调的优势。然后，我们在机器人扫描胆囊、脊柱和肾脏的实际应用中展示了其可行性。总体而言，该框架遵循多样化的指南，并在统一系统内实现可靠的自主扫描，覆盖多个解剖目标。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2603.14397

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

eNavi：基于事件的低光照室内移动机器人导航模仿策略

Ramesh, Prithvi Jai, Chanda, Kaustav, Vinod, Krishna, Vishal, Joseph Raj, Yang, Yezhou, Chakravarthi, Bharatesh

Abstract

Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/

Chinese Translation

事件相机提供高动态范围和微秒级的时间分辨率，使其非常适合用于室内机器人导航，而传统的RGB相机在快速运动或低光照条件下表现不佳。尽管在基于事件的感知方面取得了检测、SLAM和姿态估计等进展，但针对利用事件流异步特性的端到端控制策略的研究仍然有限。为了解决这一问题，我们引入了一个真实世界的室内跟随人数据集，该数据集使用TurtleBot 2机器人收集，包含多个室内地图和轨迹下的同步原始事件流、RGB帧和专家控制动作，涵盖正常和低光照条件。我们进一步构建了一个多模态数据预处理管道，该管道在时间上对齐事件和RGB观测，同时从里程计重建真实动作，以支持高质量的模仿学习。在此数据集的基础上，我们提出了一种后融合RGB-事件导航策略，该策略结合了双MobileNet编码器和基于变换器的融合模块，通过行为克隆进行训练。对RGB-only、Event-only和RGB-Event融合模型在12种训练变体下的系统评估，涵盖从单路径模仿到一般多路径模仿，显示出包含事件数据的策略，特别是融合模型，在未见的低光照条件下表现出更好的鲁棒性和更低的动作预测误差，而RGB-only模型则失败。我们在https://eventbasedvision.github.io/eNavi/上发布了数据集、同步管道和训练模型。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2603.14401

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

OCRA：基于3D和触觉先验的人机交互动作转移的物体中心学习

Wang, Kuanning, Fan, Ke, Fu, Yuqian, Lin, Siyu, Luo, Hu, Seita, Daniel, Fu, Yanwei, Jiang, Yu-Gang, Xue, Xiangyang

Abstract

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Chinese Translation

我们提出了OCRA，一个基于视频的人机交互动作转移的物体中心框架，该框架直接从人类演示视频中学习，以实现稳健的操作。物体中心学习强调与任务相关的物体及其相互作用，同时过滤掉无关的背景，为教导机器人提供了一种自然且可扩展的方法。OCRA利用多视角RGB视频、最先进的3D基础模型VGGT以及先进的检测和分割模型来重建物体中心的3D点云，捕捉物体之间丰富的交互。为了处理视觉单独难以感知的特性，我们通过一个包含超过一百万张触觉图像的大规模数据集引入触觉先验。这些3D和触觉先验通过多模态模块（ResFiLM）融合，并输入到扩散策略中以生成稳健的操作动作。在仅使用视觉和视-触觉任务上的大量实验表明，OCRA显著优于现有基线和消融实验，证明了其从人类演示视频中学习的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 56 / 2603.14457

Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping

朝向多功能光声传感器融合与体积映射

Collado-Gonzalez, Ivana, McConnell, John, Englot, Brendan

Abstract

Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the method's effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code is open-source to support community adoption.

Chinese Translation

准确的三维体积映射对于在障碍物丰富环境中操作的自主水下航行器至关重要。基于视觉的感知提供了高分辨率的数据，但在浑浊条件下表现不佳，而声纳在光照和浑浊度方面具有鲁棒性，但分辨率低且存在高度模糊性。本文提出了一种体积映射框架，通过将立体声纳对与单目相机融合，以实现不同能见度条件下的安全导航。重叠的声纳视场解决了高度模糊性，在每个时间步生成完全定义的三维点云。该框架识别相机图像中的兴趣区域，将其与相应的声纳回波关联，并结合声纳测距与相机获取的高度线索生成额外的三维点。每个三维点被分配一个反映其可靠性的置信值。这些加权置信点通过高斯过程体积映射框架进行融合，优先考虑最可靠的测量结果。与其他光声和声纳基于的方法的实验比较，以及在码头环境中的实地测试，证明了该方法在捕捉复杂几何形状和保留机器人导航所需的关键信息方面的有效性，无论是在清晰还是浑浊的条件下。我们的代码是开源的，以支持社区的采用。

View on arXiv Download PDF AI Translation

cs.RO / 57 / 2603.14469

Physics-Informed Policy Optimization via Analytic Dynamics Regularization

通过解析动力学正则化的物理信息策略优化

Chandra, Namai, Mohan, Liu, Gu, Zhihao, Wang, Lin

Abstract

Reinforcement learning (RL) has achieved strong performance in robotic control; however, state-of-the-art policy learning methods, such as actor-critic methods, still suffer from high sample complexity and often produce physically inconsistent actions. This limitation stems from neural policies implicitly rediscovering complex physics from data alone, despite accurate dynamics models being readily available in simulators. In this paper, we introduce a novel physics-informed RL framework, called PIPER, that seamlessly integrates physical constraints directly into neural policy optimization with analytical soft physics constraints. At the core of our method is the integration of a differentiable Lagrangian residual as a regularization term within the actor's objective. This residual, extracted from a robot's simulator description, subtly biases policy updates towards dynamically consistent solutions. Crucially, this physics integration is realized through an additional loss term during policy optimization, requiring no alterations to existing simulators or core RL algorithms. Extensive experiments demonstrate that our method significantly improves learning efficiency, stability, and control accuracy, establishing a new paradigm for efficient and physically consistent robotic control.

Chinese Translation

强化学习（RL）在机器人控制中取得了强劲的表现；然而，最先进的策略学习方法，如演员-评论家方法，仍然面临高样本复杂度的问题，并且常常产生物理上不一致的动作。这一限制源于神经策略仅依赖数据重新发现复杂物理现象，尽管在模拟器中已有准确的动力学模型可用。本文提出了一种新颖的物理信息强化学习框架，称为 PIPER，它将物理约束无缝地集成到神经策略优化中，采用解析软物理约束。我们方法的核心是将可微分的拉格朗日残差作为正则化项集成到演员的目标中。该残差从机器人的模拟器描述中提取，微妙地偏向策略更新朝向动态一致的解。至关重要的是，这种物理集成是在策略优化过程中通过额外的损失项实现的，无需对现有模拟器或核心 RL 算法进行任何修改。大量实验表明，我们的方法显著提高了学习效率、稳定性和控制精度，为高效且物理一致的机器人控制建立了新的范式。

View on arXiv Download PDF AI Translation

cs.RO / 58 / 2603.14498

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

R3DP：实时3D感知策略用于具身操控

Zhang, Yuhao, Dong, Wanxi, Shi, Yue, Liang, Yi, Gao, Jingnan, Yang, Qiaochu, Lyu, Yaxing, Liang, Zhixuan, Liu, Yibin, Xu, Congsheng, Guo, Xianda, Sui, Wei, Jin, Yaohui, Yang, Xiaokang, Xu, Yanyan, Mu, Yao

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

Chinese Translation

具身操控需要对物体及其空间关系进行准确的3D理解，以计划和执行丰富接触的动作。虽然大规模的3D视觉模型提供了强大的先验知识，但其计算成本使得实时控制面临显著的延迟。我们提出了实时3D感知策略（R3DP），该策略在不牺牲实时性能的情况下，将强大的3D先验知识整合到操控策略中。R3DP的核心创新是异步快慢协作模块，该模块能够无缝地将大规模的3D先验知识整合到策略中，而不会影响实时性能。该系统通过仅在稀疏关键帧上查询预训练的慢系统（VGGT），保持实时效率，同时采用轻量级的时间特征预测网络（TFPNet）为所有中间帧预测特征。通过利用历史数据以挖掘时间相关性，TFPNet通过一致的特征估计显著提高任务成功率。此外，为了实现更有效的多视图融合，我们引入了多视图特征融合器（MVFF），该融合器通过明确结合相机内参和外参来聚合各视图的特征。R3DP为将大模型整合到实时推理系统中提供了即插即用的解决方案。我们在不同视觉配置下评估了R3DP与多个基线的表现。R3DP有效利用大规模3D先验，实现了卓越的结果，在平均成功率上分别比单视图和多视图DP提高了32.9%和51.4%。此外，通过将重型3D推理与策略执行解耦，R3DP在推理时间上相比于天真地集成DP+VGGT减少了44.8%。

View on arXiv Download PDF AI Translation

cs.RO / 59 / 2603.14522

One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation

一策适用所有：面向几何的动作潜变量用于跨体现操控

Mu, Juncheng, Yang, Sizhe, Bae, Hojin, Jia, Feiyu, Ben, Qingwei, Li, Boyi, Xu, Huazhe, Pang, Jiangmiao

Abstract

Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy across multiple embodiments. We first learn a Geometry-Aware Latent Representation (GaLR), which leverages 3D convolution networks and transformers to build a shared latent action space across different embodiments. Then we design a unified latent retargeting decoder that extracts embodiment-specific actions from the latent representations, without any embodiment-specific decoder tuning. OPFA enables end-to-end co-training of data from diverse embodiments, including various grippers and dexterous hands with arbitrary degrees of freedom, significantly improving data efficiency and reducing the cost of skill transfer. We conduct extensive experiments across 11 different end-effectors. The results demonstrate that OPFA significantly improves policy performance in diverse settings by leveraging heterogeneous embodiment data. For instance, cross-embodiment co-training can improve success rates by more than 50% compared to single-source training. Moreover, by adding only a few demonstrations from a new embodiment (e.g., eight), OPFA can achieve performance comparable to that of a well-trained model with 72 demonstrations.

Chinese Translation

跨体现操控对于提升机器人操控的可扩展性和降低数据收集的高成本至关重要。然而，体现之间的显著差异，例如动作空间的变化和结构差异，给跨多个数据源的联合训练带来了挑战。为了解决这一问题，我们提出了一种框架——一策适用所有（One-Policy-Fits-All, OPFA），该框架能够在多个体现中学习一个单一的、多功能的策略。我们首先学习一种面向几何的潜在表示（Geometry-Aware Latent Representation, GaLR），该表示利用3D卷积网络和变换器构建跨不同体现的共享潜在动作空间。然后，我们设计了一个统一的潜在重定向解码器，从潜在表示中提取特定于体现的动作，而无需对任何特定于体现的解码器进行调优。OPFA实现了来自不同体现的数据的端到端共同训练，包括各种夹持器和具有任意自由度的灵巧手，显著提高了数据效率并降低了技能转移的成本。我们在11种不同的末端执行器上进行了广泛的实验。结果表明，OPFA通过利用异构体现数据显著提高了在不同环境下的策略性能。例如，与单一来源训练相比，跨体现共同训练可以将成功率提高超过50%。此外，仅通过添加来自新体现的少量演示（例如，八个），OPFA就能实现与经过72个演示的良好训练模型相当的性能。

View on arXiv Download PDF AI Translation

cs.RO / 60 / 2603.14524

Architecting Autonomy for Safe Microgravity Free-Flyer Inspection

为安全微重力自由飞行器检查构建自主系统

Albee, Keenan, Sternberg, David C., Hansson, Alexander, Schwartz, David, Majumdar, Ritwik, Jia-Richards, Oliver

Abstract

Small free-flying spacecraft can provide vital extravehicular activity (EVA) services like inspection and repair for future orbital outposts like the Lunar Gateway. Operating adjacent to delicate space station and microgravity targets, these spacecraft require formalization to describe the autonomy that a free-flyer inspection mission must provide. This work explores the transformation of general mission requirements for this class of free-flyer into a set of concrete decisions for the planning and control autonomy architectures that will power such missions. Flowing down from operator commands for inspection of important regions and mission time-criticality, a motion planning problem emerges that provides the basis for developing autonomy solutions. Unique constraints are considered such as velocity limitations, pointing, and keep-in/keep-out zones, with mission fallback techniques for providing hierarchical safety guarantees under model uncertainties and failure. Planning considerations such as cost function design and path vs. trajectory control are discussed. The typical inputs and outputs of the planning and control autonomy stack of such a mission are also provided. Notional system requirements such as solve times and propellant use are documented to inform planning and control design. The entire proposed autonomy framework for free-flyer inspection is realized in the SmallSatSim simulation environment, providing a reference example of free-flyer inspection autonomy. The proposed autonomy architecture serves as a blueprint for future implementations of small satellite autonomous inspection in proximity to mission-critical hardware, going beyond the existing literature in terms of both (1) providing realistic system requirements for an autonomous inspection mission and (2) translating these requirements into autonomy design decisions for inspection planning and control.

Chinese Translation

小型自由飞行航天器能够为未来的轨道前哨站（如月球门户）提供重要的舱外活动（EVA）服务，如检查和维修。由于这些航天器在精密的空间站和微重力目标附近操作，因此需要对自由飞行检查任务所需的自主性进行正式化描述。本研究探讨了将此类自由飞行器的一般任务要求转化为一组具体决策，以规划和控制自主架构，从而支持此类任务的实施。基于操作员对重要区域的检查命令和任务时间的紧迫性，出现了一个运动规划问题，为开发自主解决方案提供了基础。考虑了独特的约束条件，如速度限制、指向和保持区域/禁止区域，并提出了在模型不确定性和故障情况下提供分层安全保障的任务回退技术。讨论了规划考虑因素，如成本函数设计和路径与轨迹控制。此外，还提供了此类任务的规划和控制自主堆栈的典型输入和输出。记录了诸如求解时间和推进剂使用等概念性系统要求，以便为规划和控制设计提供参考。整个提出的自由飞行器检查自主框架在SmallSatSim仿真环境中得以实现，提供了自由飞行器检查自主性的参考示例。所提出的自主架构为未来小型卫星在关键任务硬件附近的自主检查实施提供了蓝图，超越了现有文献，在（1）为自主检查任务提供现实的系统要求和（2）将这些要求转化为检查规划和控制的自主设计决策方面均有所突破。

View on arXiv Download PDF AI Translation

cs.RO / 61 / 2603.14529

Bots and Blocks: Presenting a project-based approach for robotics education

机器人与积木：提出一种基于项目的机器人教育方法

Geger, Tobias, Briechle, Dominique, Rausch, Andreas

Abstract

To prepare students for upcoming trends and challenges, it is important to teach them about the helpful and important aspects of modern technologies, such as robotics. However, classic study programs often fail to prepare students for working in the industry because of the lack of practical experience, caused by solely theoretical lecturing. The challenge is to teach both practical and theoretical skills interactively to improve the students' learning. In the scope of the paper, a project-based learning approach is proposed, where students are taught in an agile, semester-spanning project how to work with robots. This project is part of the applied computer science degree study program Digital Technologies. The paper presents the framework as well as an exemplary project featuring the development of a disassembly software ecosystem for hardware robots. In the project, the students are taught the programming of robots with the help of the Robot Operating System (ROS). To ensure the base qualifications, the students are taught in so-called schools, an interactive mix of lectures and exercises. At the beginning of the course, the basics of the technologies are covered, while the students work more and more in their team with the robot on a specific use case. The use case here is to automate the disassembly of build block assemblies.

Chinese Translation

为了使学生能够应对未来的趋势和挑战，教授他们现代技术（如机器人）的有益和重要方面至关重要。然而，传统的学习项目往往无法使学生为进入行业做好准备，因为仅依靠理论讲授导致缺乏实践经验。挑战在于以互动的方式教授实践和理论技能，以提高学生的学习效果。本文提出了一种基于项目的学习方法，在这一方法中，学生通过一个跨学期的敏捷项目学习如何与机器人合作。该项目是应用计算机科学学位课程数字技术的一部分。本文展示了该框架以及一个示例项目，内容为开发硬件机器人拆解软件生态系统。在项目中，学生在机器人操作系统（Robot Operating System, ROS）的帮助下学习机器人编程。为了确保基础资格，学生在所谓的学校中接受教学，这是一种讲座与练习的互动混合形式。在课程开始时，学生学习技术基础知识，而后他们在团队中逐渐与机器人合作，针对特定的使用案例进行工作。这里的使用案例是自动化构建块组件的拆解。

View on arXiv Download PDF AI Translation

cs.RO / 62 / 2603.14554

MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer

MorFiC：修正零样本四足动物转移中的价值误校准

Mishra, Prakhar, Raj, Amir Hossain, Xiao, Xuesu, Manocha, Dinesh

Abstract

Generalizing learned locomotion policies across quadrupedal robots with different morphologies remain a challenge. Policies trained on a single robot often break when deployed on embodiments with different mass distributions, kinematics, joint limits, or actuation constraints, forcing per robot retraining. We present MorFiC, a reinforcement learning approach for zero-shot cross-morphology locomotion using a single shared policy. MorFiC resolves a key failure mode in multi-morphology actor-critic training: a shared critic tends to average incompatible value targets across embodiments, yielding miscalibrated advantages. To address this, MorFiC conditions the critic via morphology-aware modulation driven by robot physical and control parameters, generating morphology-specific value estimates within a shared network. Trained with a single source robot with morphology randomization in simulation, MorFiC can transfer to unseen robots and surpasses morphology-conditioned PPO baselines by improving stable average speed and longest stable run on multiple targets, including speed gains of +16.1% on A1, ~2x on Cheetah, and ~5x on B1. We additionally show that MorFiC reduces the value-prediction error variance across morphologies and stabilizes the advantage estimates, demonstrating that the improved value-function calibration corresponds to a stronger transfer performance. Finally, we demonstrate zero-shot deployment on two Unitree Go1 and Go2 robots without fine-tuning, indicating that critic-side conditioning is a practical approach for cross-morphology generalization.

Chinese Translation

在不同形态的四足机器人之间推广学习到的运动策略仍然是一项挑战。在单一机器人上训练的策略在部署到具有不同质量分布、运动学、关节限制或驱动约束的实体时往往会失效，迫使每个机器人都需要重新训练。我们提出了MorFiC，这是一种用于零样本跨形态运动的强化学习方法，使用单一共享策略。MorFiC解决了多形态演员-评论家训练中的一个关键失效模式：共享评论家倾向于在不同实体之间平均不兼容的价值目标，从而导致误校准的优势。为了解决这个问题，MorFiC通过由机器人物理和控制参数驱动的形态感知调制来调整评论家，在共享网络内生成特定于形态的价值估计。在模拟中使用具有形态随机化的单一源机器人进行训练后，MorFiC能够转移到未见过的机器人，并在多个目标上超越基于形态条件的PPO基线，改善了稳定的平均速度和最长稳定运行，包括在A1上速度提高了+16.1%，在Cheetah上约提高了2倍，在B1上约提高了5倍。此外，我们还展示了MorFiC减少了跨形态的价值预测误差方差，并稳定了优势估计，证明了改进的价值函数校准与更强的转移性能相关。最后，我们展示了在两台Unitree Go1和Go2机器人上进行零样本部署，而无需微调，这表明评论家侧的调节是一种有效的跨形态推广方法。

View on arXiv Download PDF AI Translation

cs.RO / 63 / 2603.14598

SmallSatSim: A High-Fidelity Simulation and Training Toolkit for Microgravity Robotic Close Proximity Operations

SmallSatSim：用于微重力机器人近距离操作的高保真仿真与训练工具包

Schwartz, David, Hansson, Alexander, Bodmer, Sabrina, Sternberg, David, Jia-Richards, Oliver, Albee, Keenan

Abstract

Microgravity rendezvous and close proximity operations (RPO) is a growing area of interest for applications spanning in-space assembly and manufacturing (ISAM), orbital debris remediation, and small body exploration. Microgravity environments present unique challenges for robotic control and planning algorithms for new agile RPO mission scenarios like free-floating manipulation, planning under failure, and estimating high-fidelity dynamics of tumbling bodies. To facilitate the development and testing of novel RPO algorithms, we introduce SmallSatSim, a high-fidelity simulation toolkit that leverages the MuJoCo physics engine to accurately model small satellite RPO dynamics in local microgravity robotic free-flight settings, including under model disturbances and perturbations. The framework includes cutting edge out-of-the-box free-flyer control techniques. A GPU-accelerated pipeline using MuJoCo MJX and JAX is implemented for sampling- and learning-based simulation uses cases. SmallSatSim also supports configurable failure models, enabling the evaluation of safe control strategies under adversarial conditions. Visualization, logging, and GPU-enabled parallelization further enhance SmallSatSim's capability for RPO testing. We outline SmallSatSim's features and intended use cases, and demonstrate its use for robotic RPO planning and control. The open-sourced toolkit aims to accelerate research in autonomous, agile robotic small satellite operations.

Chinese Translation

微重力交会与近距离操作（RPO）是一个日益受到关注的领域，应用范围包括在轨组装与制造（ISAM）、轨道 debris 清理以及小天体探索。微重力环境对机器人控制和规划算法提出了独特的挑战，特别是在自由漂浮操作、故障下规划以及估计翻滚体的高保真动力学等新型灵活 RPO 任务场景中。为了促进新型 RPO 算法的开发与测试，我们推出了 SmallSatSim，这是一款高保真的仿真工具包，利用 MuJoCo 物理引擎准确建模小卫星在局部微重力机器人自由飞行环境中的 RPO 动力学，包括在模型扰动和干扰下的情况。该框架包含先进的即插即用自由飞行器控制技术。我们实现了一个使用 MuJoCo MJX 和 JAX 的 GPU 加速管道，用于基于采样和学习的仿真应用。SmallSatSim 还支持可配置的故障模型，使得在不利条件下评估安全控制策略成为可能。可视化、日志记录和 GPU 加速并行化进一步增强了 SmallSatSim 在 RPO 测试中的能力。我们概述了 SmallSatSim 的特性和预期使用案例，并展示了其在机器人 RPO 规划与控制中的应用。该开源工具包旨在加速自主、灵活的机器人小卫星操作研究。

View on arXiv Download PDF AI Translation

cs.RO / 64 / 2603.14603

Latent Dynamics-Aware OOD Monitoring for Trajectory Prediction with Provable Guarantees

具有可证明保证的轨迹预测潜在动态感知OOD监测

Guo, Tongfei, Su, Lili

Abstract

In safety-critical Cyber-Physical Systems (CPS), accurate trajectory prediction provides vital guidance for downstream planning and control, yet although deep learning models achieve high-fidelity forecasts on validation data, their reliability degrades under out-of-distribution (OOD) scenarios caused by environmental uncertainty or rare traffic behaviors in real-world deployment; detecting such OOD events is challenging due to evolving traffic conditions and changing interaction patterns, while safety-critical applications demand formal guarantees on detection delay and false-alarm rates, motivating us-following recent work [1]-to formulate OOD monitoring for trajectory prediction as a quickest changepoint detection (QCD) problem that offers a principled statistical framework with established theory; we further observe that the real-world evolution of prediction errors under in-distribution (ID) conditions can be effectively modeled by a Hidden Markov Model (HMM), and by leveraging this structure we extend the cumulative Maximum Mean Discrepancy approach to enable detection without requiring explicit knowledge of the post-change distribution while still admitting provable guarantees on delay and false alarms, with experiments on three real-world driving datasets demonstrating reduced detection delay and robustness to heavy-tailed errors and unknown post-change conditions.

Chinese Translation

在安全关键的网络物理系统（CPS）中，准确的轨迹预测为下游规划和控制提供了重要指导。然而，尽管深度学习模型在验证数据上实现了高保真预测，但在由于环境不确定性或现实世界部署中稀有交通行为引起的分布外（OOD）场景下，其可靠性会下降；由于交通条件的不断变化和交互模式的改变，检测此类OOD事件具有挑战性，而安全关键应用则要求对检测延迟和误报率提供正式保证，这促使我们——基于近期的研究工作[1]——将轨迹预测的OOD监测形式化为一个最快变点检测（QCD）问题，该问题提供了一个具有既定理论的原则性统计框架；我们进一步观察到，在分布内（ID）条件下，预测误差的真实世界演变可以通过隐马尔可夫模型（HMM）有效建模，通过利用这一结构，我们扩展了累积最大均值差异（Maximum Mean Discrepancy）方法，以实现无需显式了解变更后分布的检测，同时仍然承认对延迟和误报的可证明保证，通过在三个真实世界驾驶数据集上的实验，展示了减少检测延迟和对重尾误差及未知变更后条件的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 65 / 2603.14604

Tactile Modality Fusion for Vision-Language-Action Models

触觉模态融合用于视觉-语言-动作模型

Morissette, Charlotte, Abyaneh, Amin, Chang, Wei-Di, Houssaini, Anas, Meger, David, Lin, Hsiu-Chin, Tremblay, Jonathan, Dudek, Gregory

Abstract

We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.

Chinese Translation

我们提出了TacFiLM，这是一种轻量级的模态融合方法，将视觉-触觉信号整合到视觉-语言-动作（VLA）模型中。尽管最近在VLA模型方面的进展引入了既具有可推广性又具有语义基础的机器人策略，但这些模型主要依赖于基于视觉的感知。然而，仅凭视觉无法捕捉在接触丰富的操作过程中发生的复杂交互动态，包括接触力、表面摩擦、顺应性和剪切力。虽然最近将触觉信号整合到VLA模型中的尝试通常通过标记连接或大规模预训练增加了复杂性，但行为模型的高计算需求需要更轻量级的融合策略。为了解决这些挑战，TacFiLM概述了一种后训练微调方法，该方法使用特征级线性调制（FiLM）将中间视觉特征条件化于预训练的触觉表示。插入任务的实验结果表明，在分布内和分布外任务中，成功率、直接插入性能、完成时间和力稳定性均有一致的改善。这些结果共同支持我们的方法作为将触觉信号有效整合到VLA模型中的一种有效方法，从而改善接触丰富的操作行为。

View on arXiv Download PDF AI Translation

cs.RO / 66 / 2603.14605

CyboRacket: A Perception-to-Action Framework for Humanoid Racket Sports

CyboRacket：一种用于类人网球运动的感知-行动框架

Ren, Peng, Qi, Chuan, Ge, Haoyang, Su, Qiyuan, He, Xuguo, Huang, Cong, Chi, Pei, Zhao, Jiang, Chen, Kai

Abstract

Dynamic ball-interaction tasks remain challenging for robots because they require tight perception-action coupling under limited reaction time. This challenge is especially pronounced in humanoid racket sports, where successful interception depends on accurate visual tracking, trajectory prediction, coordinated stepping, and stable whole-body striking. Existing robotic racket-sport systems often rely on external motion capture for state estimation or on task-specific low-level controllers that must be retrained across tasks and platforms. We present CyboRacket, a hierarchical perception-to-action framework for humanoid racket sports that integrates onboard visual perception, physics-based trajectory prediction, and large-scale pre-trained whole-body control. The framework uses onboard cameras to track the incoming object, predicts its future trajectory, and converts the estimated interception state into target end-effector and base-motion commands for whole-body execution by SONIC on the Unitree G1 humanoid robot. We evaluate the proposed framework in a vision-based humanoid tennis-hitting task. Experimental results demonstrate real-time visual tracking, trajectory prediction, and successful striking using purely onboard sensing.

Chinese Translation

动态球体交互任务对机器人仍然具有挑战性，因为它们要求在有限的反应时间内实现紧密的感知-行动耦合。这一挑战在类人网球运动中尤为明显，成功的拦截依赖于准确的视觉跟踪、轨迹预测、协调的步伐和稳定的全身击打。现有的机器人网球运动系统通常依赖于外部运动捕捉进行状态估计，或依赖于特定任务的低级控制器，这些控制器必须在不同任务和平台之间重新训练。我们提出了CyboRacket，这是一种用于类人网球运动的分层感知-行动框架，集成了机载视觉感知、基于物理的轨迹预测和大规模预训练的全身控制。该框架使用机载摄像头跟踪来袭物体，预测其未来轨迹，并将估计的拦截状态转换为目标末端执行器和基础运动命令，以便由SONIC在Unitree G1类人机器人上进行全身执行。我们在基于视觉的类人网球击打任务中评估了该框架。实验结果表明，使用纯机载传感器实现了实时视觉跟踪、轨迹预测和成功击打。

View on arXiv Download PDF AI Translation

cs.RO / 67 / 2603.14634

Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation

基于粒子的模拟中的物理准确刚体动力学

Abderezaei, Ava, Nechyporenko, Nataliya, Miceli, Joseph, Briscoe-Martinez, Gilberto, Roncone, Alessandro

Abstract

Robotics demands simulation that can reason about the diversity of real-world physical interactions, from rigid to deformable objects and fluids. Current simulators address this by stitching together multiple subsolvers for different material types, resulting in a compositional architecture that complicates physical reasoning. Particle-based simulators offer a compelling alternative, representing all materials through a single unified formulation that enables seamless cross-material interactions. Among particle-based simulators, position-based dynamics (PBD) is a popular solver known for its computational efficiency and visual plausibility. However, its lack of physical accuracy has limited its adoption in robotics. To leverage the benefits of particle-based solvers while meeting the physical fidelity demands of robotics, we introduce PBD-R, a revised PBD formulation that enforces physically accurate rigid-body dynamics through a novel momentum-conservation constraint and a modified velocity update. Additionally, we introduce a solver-agnostic benchmark with analytical solutions to evaluate physical accuracy. Using this benchmark, we show that PBD-R significantly outperforms PBD and achieves competitive accuracy with MuJoCo while requiring less computation.

Chinese Translation

机器人技术需要能够推理现实世界中各种物理交互的模拟，从刚体到可变形物体以及流体。目前的模拟器通过将不同材料类型的多个子求解器拼接在一起来解决这个问题，导致了一个复杂的组合架构，增加了物理推理的难度。基于粒子的模拟器提供了一种引人注目的替代方案，通过单一统一的公式表示所有材料，从而实现无缝的跨材料交互。在基于粒子的模拟器中，基于位置的动力学（Position-Based Dynamics, PBD）是一种因其计算效率和视觉可信度而受到欢迎的求解器。然而，其缺乏物理准确性限制了其在机器人技术中的应用。为了利用基于粒子的求解器的优势，同时满足机器人技术对物理保真度的要求，我们提出了PBD-R，这是一种修订的PBD公式，通过一种新颖的动量守恒约束和修改的速度更新来强制实施物理准确的刚体动力学。此外，我们引入了一个与求解器无关的基准测试，提供解析解以评估物理准确性。通过这个基准测试，我们展示了PBD-R显著优于PBD，并在计算量更少的情况下与MuJoCo达到了竞争性的准确性。

View on arXiv Download PDF AI Translation

cs.RO / 68 / 2603.14639

Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection

部署位置可视化：基于度量RGB的空中至地面隐蔽空间可通行性分析

Lee, Seoyoung, Shithil, Shaekh Mohammad, Pushp, Durgakant, Liu, Lantao, Wang, Zhangyang

Abstract

Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial-ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB-based geometric-semantic reconstruction and traversability analysis framework for aerial-to-ground hidden space inspection. A feed-forward multi-view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment-relevant measurements without LiDAR-based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence-aware geometric-semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV-UGV platform demonstrate reliable deployment-zone identification in hidden space scenarios.

Chinese Translation

对如涵洞等受限基础设施的检查通常需要进入主要从高处视角可达的隐蔽空间。空中与地面的合作使得无人机（UAV）能够部署紧凑型地面无人驾驶车辆（UGV）进行内部探索，但从空中观察中选择合适的部署区域需要涉及尺度模糊、重建不确定性和地形语义的度量地形推理。我们提出了一种基于度量RGB的几何-语义重建和可通行性分析框架，用于空中至地面的隐蔽空间检查。前馈多视角RGB重建主干网络生成稠密几何信息，而时间一致的语义分割则生成3D语义图。为了在没有基于激光雷达的稠密映射的情况下实现与部署相关的测量，我们引入了一种具身运动先验，通过强制预测的相机运动与机载平台自运动之间的一致性来恢复度量尺度。基于度量基础的重建，我们构建了一个关注置信度的几何-语义可通行性图，并在明确的可达性约束下评估候选部署区域。在一个有绳无人机-地面无人驾驶车辆平台上的实验表明，在隐蔽空间场景中能够可靠地识别部署区域。

View on arXiv Download PDF AI Translation

cs.RO / 69 / 2603.14656

Coordinate-Independent Robot Model Identification

坐标无关的机器人模型识别

Yang, Yanhao, Hatton, Ross L.

Abstract

Robot model identification is commonly performed by least-squares regression on inverse dynamics, but existing formulations measure residuals directly in coordinate force space and therefore depend on the chosen coordinate chart, units, and scaling. This paper proposes a coordinate-independent identification method that weights inverse-dynamics residuals by the dual metric induced by the system Riemannian metric. Using the force--velocity vector--covector duality, the dual metric provides a physically meaningful normalization of generalized forces, pulling coordinate residuals back into the ambient mechanical space and eliminating coordinate-induced bias. The resulting objective remains convex through an affine-metric and Schur-complement reformulation, and is compatible with physical-consistency constraints and geometric regularization. Experiments on an inertia-dominated Crazyflie--pendulum system and a drag-dominated LandSalp robot show improved identification accuracy, especially on shape coordinates, in both low-data and high-data settings.

Chinese Translation

机器人模型识别通常通过对逆动力学进行最小二乘回归来实现，但现有的公式直接在坐标力空间中测量残差，因此依赖于所选择的坐标图、单位和缩放。本文提出了一种坐标无关的识别方法，通过系统的黎曼度量诱导的对偶度量对逆动力学残差进行加权。利用力-速度向量-共向量的对偶性，对偶度量提供了一种物理上有意义的广义力归一化，将坐标残差拉回到环境机械空间中，从而消除由坐标引起的偏差。通过仿射度量和舒尔补重构，得到的目标函数保持凸性，并与物理一致性约束和几何正则化兼容。在一个以惯性为主导的Crazyflie-摆系统和一个以阻力为主导的LandSalp机器人上的实验表明，在低数据和高数据环境下，识别精度得到了显著提高，特别是在形状坐标方面。

View on arXiv Download PDF AI Translation

cs.RO / 70 / 2603.14698

A Dual Quaternion Framework for Collision Recovery of Quadrotor

用于四旋翼碰撞恢复的双四元数框架

Gaucher, Valentin, Zhang, Wenlong

Abstract

Unmanned aerial vehicles (UAVs) operating in cluttered environments require accurate impact modeling to maintain stability. However, conventional contact models decouple linear and angular impulses, risking manifold inconsistency during rapid state transitions. This article presents a dual quaternion reset map that resolves rigid-body impacts directly on the SE(3) manifold. By operating on the unified spatial twist (linear and angular velocities as a single dual entity), our formulation is algebraically equivalent to the classical Newton impulse model while preserving manifold consistency during discrete state jumps. Building on this framework, we design a hybrid recovery controller that couples linear and angular momentum to ensure strict energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24% reduction in execution latency compared to an optimized matrix-based implementation. High-fidelity MuJoCo simulations validate the controller's robustness to complex contact dynamics, showing a 56.6% reduction in post-impact root-mean-square error (RMSE) and a 41.2% decrease in peak kinetic energy compared to decoupled recovery methods.

Chinese Translation

在复杂环境中操作的无人机（UAV）需要准确的碰撞建模以维持稳定性。然而，传统的接触模型将线性和角动量分离，这在快速状态转换过程中可能导致流形不一致。本文提出了一种双四元数重置映射，能够直接在 SE(3) 流形上解决刚体碰撞问题。通过在统一的空间扭转（将线性和角速度作为单一的双重实体）上操作，我们的公式在代数上等价于经典的牛顿冲量模型，同时在离散状态跳跃过程中保持流形一致性。在此框架基础上，我们设计了一种混合恢复控制器，将线性和角动量耦合，以确保在碰撞过程中严格的能量耗散。硬件在环测试基准显示，与优化的基于矩阵的实现相比，执行延迟减少了 24%。高保真度的 MuJoCo 模拟验证了控制器对复杂接触动力学的鲁棒性，显示出与解耦恢复方法相比，碰撞后均方根误差（RMSE）减少了 56.6%，峰值动能降低了 41.2%。

View on arXiv Download PDF AI Translation

cs.RO / 71 / 2603.14763

LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision

LiDAR-EVS：通过伪LiDAR监督增强3D高斯点云外推视图合成

Huang, Yiming, Kang, Xin, Zhang, Sipeng, Ren, Hongliang, Zhang, Weihua, Lai, Junjie

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time LiDAR and camera synthesis in autonomous driving simulation. However, simulating LiDAR with 3DGS remains challenging for extrapolated views beyond the training trajectory, as existing methods are typically trained on single-traversal sensor scans, suffer from severe overfitting and poor generalization to novel ego-vehicle paths. To enable reliable simulation of LiDAR along unseen driving trajectories without external multi-pass data, we present LiDAR-EVS, a lightweight framework for robust extrapolated-view LiDAR simulation in autonomous driving. Designed to be plug-and-play, LiDAR-EVS readily extends to diverse LiDAR sensors and neural rendering baselines with minimal modification. Our framework comprises two key components: (1) pseudo extrapolated-view point cloud supervision with multi-frame LiDAR fusion, view transformation, occlusion curling, and intensity adjustment; (2) spatially-constrained dropout regularization that promotes robustness to diverse trajectory variations encountered in real-world driving. Extensive experiments demonstrate that LiDAR-EVS achieves SOTA performance on extrapolated-view LiDAR synthesis across three datasets, making it a promising tool for data-driven simulation, closed-loop evaluation, and synthetic data generation in autonomous driving systems.

Chinese Translation

3D高斯点云外推（3DGS）已成为自动驾驶仿真中实时LiDAR与相机合成的强大技术。然而，使用3DGS模拟超出训练轨迹的外推视图仍然具有挑战性，因为现有方法通常是在单次遍历的传感器扫描上进行训练，容易出现严重的过拟合，并且对新颖的自我车辆路径的泛化能力较差。为了在未见过的驾驶轨迹上实现可靠的LiDAR模拟，而无需外部多次数据，我们提出了LiDAR-EVS，这是一个轻量级框架，用于在自动驾驶中进行稳健的外推视图LiDAR模拟。LiDAR-EVS旨在即插即用，能够以最小的修改扩展到多种LiDAR传感器和神经渲染基线。我们的框架包含两个关键组件：（1）伪外推视图点云监督，结合多帧LiDAR融合、视图变换、遮挡卷曲和强度调整；（2）空间约束的随机失活正则化，促进对真实世界驾驶中遇到的各种轨迹变化的稳健性。大量实验表明，LiDAR-EVS在三个数据集上实现了外推视图LiDAR合成的最先进性能，成为数据驱动仿真、闭环评估和自动驾驶系统中合成数据生成的有前景工具。

View on arXiv Download PDF AI Translation

cs.RO / 72 / 2603.14786

CORAL: COntextual Reasoning And Local Planning in A Hierarchical VLM Framework for Underwater Monitoring

CORAL：在层次化视觉语言模型框架下的上下文推理与局部规划用于水下监测

Wu, Zhenqi, Lu, Yuanjie, Xiao, Xuesu, Lin, Xiaomin

Abstract

Oyster reefs are critical ecosystem species that sustain biodiversity, filter water, and protect coastlines, yet they continue to decline globally. Restoring these ecosystems requires regular underwater monitoring to assess reef health, a task that remains costly, hazardous, and limited when performed by human divers. Autonomous underwater vehicles (AUVs) offer a promising alternative, but existing AUVs rely on geometry-based navigation that cannot interpret scene semantics. Recent vision-language models (VLMs) enable semantic reasoning for intelligent exploration, but existing VLM-driven systems adopt an end-to-end paradigm, introducing three key limitations. First, these systems require the VLM to generate every navigation decision, forcing frequent waits for inference. Second, VLMs cannot model robot dynamics, causing collisions in cluttered environments. Third, limited self-correction allows small deviations to accumulate into large path errors. To address these limitations, we propose CORAL, a framework that decouples high-level semantic reasoning from low-level reactive control. The VLM provides high-level exploration guidance by selecting waypoints, while a dynamics-based planner handles low-level collision-free execution. A geometric verification module validates waypoints and triggers replanning when needed. Compared with the previous state-of-the-art, CORAL improves coverage by 14.28% percentage points, or 17.85% relatively, reduces collisions by 100%, and requires 57% fewer VLM calls.

Chinese Translation

牡蛎礁是维持生物多样性、过滤水体和保护海岸线的重要生态系统物种，但它们在全球范围内仍在持续下降。恢复这些生态系统需要定期的水下监测以评估礁体健康，而这一任务在人工潜水员执行时仍然成本高昂、危险且受限。自主水下航行器（AUV）提供了一种有前景的替代方案，但现有的AUV依赖于基于几何的导航，无法解读场景语义。近期的视觉语言模型（VLM）使得智能探索的语义推理成为可能，但现有的基于VLM的系统采用端到端的范式，带来了三个主要限制。首先，这些系统要求VLM生成每一个导航决策，导致频繁等待推理。其次，VLM无法建模机器人动态，导致在杂乱环境中发生碰撞。第三，有限的自我修正能力使得小偏差累积成大的路径误差。为了解决这些限制，我们提出了CORAL，一个将高层次语义推理与低层次反应控制解耦的框架。VLM通过选择航点提供高层次的探索指导，而基于动态的规划器则处理低层次的无碰撞执行。几何验证模块验证航点并在需要时触发重新规划。与之前的最先进技术相比，CORAL的覆盖率提高了14.28个百分点，或相对提高了17.85%，碰撞减少了100%，并且需要的VLM调用减少了57%。

View on arXiv Download PDF AI Translation

cs.RO / 73 / 2603.14787

Exploring the dynamic properties and motion reproducibility of a small upper-body humanoid robot with 13-DOF pneumatic actuation for data-driven control

探索具有13自由度气动驱动的小型上半身类人机器人动态特性和运动可重复性以实现数据驱动控制

Atsuta, Hiroshi, Ishihara, Hisashi, Asada, Minoru

Abstract

Pneumatically-actuated anthropomorphic robots with high degrees of freedom (DOF) offer significant potential for physical human-robot interaction. However, precise control of pneumatic actuators is challenging due to their inherent nonlinearities. This paper presents the development of a compact 13-DOF upper-body humanoid robot. To assess the feasibility of an effective controller, we first investigate its key dynamic properties, such as actuation time delays, and confirm that the system exhibits highly reproducible behavior. Leveraging this reproducibility, we implement a preliminary data-driven controller for a 4-DOF arm subsystem based on a multilayer perceptron with explicit time delay compensation. The network was trained on random movement data to generate pressure commands for tracking arbitrary trajectories. Comparative evaluations with a traditional PID controller demonstrate superior trajectory tracking performance, highlighting the potential of data-driven approaches for controlling complex, high-DOF pneumatic robots.

Chinese Translation

具有高自由度（DOF）的气动驱动类人机器人在物理人机交互中具有显著潜力。然而，由于气动执行器固有的非线性特性，精确控制面临挑战。本文介绍了一种紧凑型13自由度上半身类人机器人的开发。为了评估有效控制器的可行性，我们首先研究其关键动态特性，如驱动时间延迟，并确认该系统表现出高度可重复的行为。利用这种可重复性，我们为一个基于多层感知器的4自由度臂子系统实施了初步的数据驱动控制器，并进行了显式时间延迟补偿。该网络在随机运动数据上进行训练，以生成跟踪任意轨迹的压力指令。与传统PID控制器的比较评估显示出更优的轨迹跟踪性能，突显了数据驱动方法在控制复杂高自由度气动机器人的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 74 / 2603.14789

GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

GraspALL：在任何低光照条件下适应性结构补偿以应对照明变化的机器人服装抓取

Zhong, Haifeng, Han, Wenshuo, Wang, Zhouyu, Feng, Runyang, Tang, Fan, Lee, Tong-Yee, Fan, Zipei, Wu, Ruihai, Wang, Yuran, Dong, Hao, Chen, Hechang, Chang, Hyung Jin, Gao, Yixing

Abstract

Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots.However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness.Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model's adaptability to illumination changes when utilizing multimodal information.To address this problem, we propose GraspALL, an illumination-structure interactive compensation model.The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature fusion between RGB and non-RGB modalities according to varying lighting intensities, thereby generating illumination-consistent grasping representations.Experiments on the self-built garment grasping dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baselines under diverse illumination conditions.

Chinese Translation

在动态变化的照明条件下实现准确的服装抓取对服务机器人全天候操作至关重要。然而，低光照场景中的照明减少严重削弱了服装的结构特征，导致抓取鲁棒性显著下降。现有方法通常通过利用非RGB模态的照明不变特性来增强RGB特征，但它们忽视了在不同照明条件下对非RGB特征的变化依赖，这可能导致非RGB线索的错位，从而削弱模型在利用多模态信息时对照明变化的适应性。为了解决这个问题，我们提出了GraspALL，一种照明-结构交互补偿模型。GraspALL的创新之处在于将连续的照明变化编码为定量参考，以指导根据不同照明强度进行RGB和非RGB模态之间的自适应特征融合，从而生成照明一致的抓取表示。在自建的服装抓取数据集上的实验表明，GraspALL在多种照明条件下相较于基线提高了32-44%的抓取准确率。

View on arXiv Download PDF AI Translation

cs.RO / 75 / 2603.14809

A Unified Calibration Framework for Coordinate and Kinematic Parameters in Dual-Arm Robots

双臂机器人坐标和运动参数的统一标定框架

Huang, Tianyu, Yang, Bohan, Li, Bin, Li, Wenpan, Li, Haoang, Li, Wenlong, Liu, Yun-Hui

Abstract

Precise collaboration in vision-based dual-arm robot systems requires accurate system calibration. Recent dual-robot calibration methods have achieved strong performance by simultaneously solving multiple coordinate transformations. However, these methods either treat kinematic errors as implicit noise or handle them through separated error modeling, resulting in non-negligible accumulated errors. In this paper, we present a novel framework for unified calibration of the coordinate transformations and kinematic parameters in both robot arms. Our key idea is to unify all the tightly coupled parameters within a single Lie-algebraic formulation. To this end, we construct a consolidated error model grounded in the product-of-exponentials formula, which naturally integrates the coordinate and kinematic parameters in twist forms. Our model introduces no artificial error separation and thus greatly mitigates the error propagation. In addition, we derive a closed-form analytical Jacobian from this model using Lie derivatives. By exploring the Jacobian rank property, we analyze the identifiability of all calibration parameters and show that our joint optimization is well-posed under mild conditions. This enables off-the-shelf iterative solvers to stably optimize these parameters on the manifold space. Besides, to ensure robust convergence of our joint optimization, we develop a certifiably correct algorithm for initializing the unknown coordinates. Relying on semidefinite relaxation, our algorithm can yield a reliable estimate whose near-global optimality can be verified a posteriori. Extensive experiments validate the superior accuracy of our approach over previous baselines under identical visual measurements. Meanwhile, our certifiable initialization consistently outperforms several coordinate-only baselines, proving its reliability as a starting point for joint optimization.

Chinese Translation

基于视觉的双臂机器人系统的精确协作需要准确的系统标定。近期的双机器人标定方法通过同时解决多个坐标变换，取得了良好的性能。然而，这些方法要么将运动学误差视为隐式噪声，要么通过分离的误差建模来处理，导致不可忽视的累积误差。本文提出了一种新颖的框架，用于统一标定两个机器人臂的坐标变换和运动学参数。我们的关键思想是将所有紧密耦合的参数统一在一个单一的李代数形式中。为此，我们构建了一个基于指数积公式的综合误差模型，该模型自然地将坐标和运动学参数以扭转形式整合在一起。我们的模型没有引入人为的误差分离，从而大大减轻了误差传播。此外，我们利用李导数从该模型中推导出一个闭式解析雅可比矩阵。通过探索雅可比矩阵的秩属性，我们分析了所有标定参数的可识别性，并表明我们的联合优化在温和条件下是良定义的。这使得现成的迭代求解器能够在流形空间上稳定地优化这些参数。此外，为了确保我们联合优化的稳健收敛，我们开发了一种可证明正确的算法来初始化未知坐标。依赖于半正定松弛，我们的算法能够产生一个可靠的估计，其近似全局最优性可以事后验证。大量实验验证了我们的方法在相同视觉测量下相较于先前基线的优越准确性。同时，我们的可证明初始化在多个仅坐标基线中始终表现优于，证明了其作为联合优化起点的可靠性。

View on arXiv Download PDF AI Translation

cs.RO / 76 / 2603.14811

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

从自我到世界：通过强化学习实现具身系统中的协作空间推理

Zhou, Heng, Kang, Li, Qin, Yiran, Song, Xiufeng, Yu, Ao, Zhang, Zilu, Song, Haoming, Xu, Kaixin, Fan, Yuchen, Zhou, Dongzhan, Liu, Xiaohong, Zhang, Ruimao, Torr, Philip, Bai, Lei, Yin, Zhenfei

Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.

Chinese Translation

从分布式、部分视角理解世界是具身多智能体系统面临的一个基本挑战。每个智能体通过自我中心的视角感知环境，这种视角常常受到遮挡和模糊的限制。为研究这一问题，我们引入了Ego-to-World (E2W) 基准，评估视觉-语言模型在三个任务中融合异构视角的能力：（i）全局计数，（ii）关系位置推理，以及（iii）需要预测视角特定图像坐标的面向动作的抓取。为了解决这一设置，我们提出了CoRL，一个结合了链式思维（Chain-of-Thought）监督微调与使用组相对策略优化（Group-Relative Policy Optimization）的强化学习的两阶段框架。其核心组件，交叉视角空间奖励（Cross-View Spatial Reward, CVSR），通过将推理步骤与视觉证据关联，提供密集的任务对齐反馈，确保跨视角实体解析的一致性，并引导模型朝向正确的最终预测。E2W上的实验表明，CoRL在推理和感知基础指标上始终超越强大的专有和开源基准，而消融实验进一步确认了每个CVSR组件的必要性。此外，CoRL能够推广到外部空间推理基准，并实现有效的现实世界多机器人操作，配合校准的多摄像头设备，展示了跨视角定位和成功的抓取与放置执行。E2W和CoRL共同为从分布式自我中心观察中学习世界中心场景理解提供了一个原则性基础，推动了协作具身人工智能的发展。

View on arXiv Download PDF AI Translation

cs.RO / 77 / 2603.14852

Surgical Robot, Path Planning, Joint Space, Riemannian Manifolds

外科机器人、路径规划、关节空间、黎曼流形

Yamamoto, Yoshiki, Sogabe, Maina, Hirahara, Shunichi, Kaisaki, Toshiki, Miyazaki, Tetsuro, Kawashima, Kenji

Abstract

Robotic surgery for minimally invasive surgery can reduce the surgeon's workload by autonomously guiding robotic forceps. Movement of the robot is restricted around a fixed insertion port. The robot often encounters angle limitations during operation. Also, the surface of the abdominal cavity is non-concave, making it computationally expensive to find the desired path.In this work, to solve these problems, we propose a method for path planning in joint space by transforming the position into a Riemannian manifold. An edge cost function is defined to search for a desired path in the joint space and reduce the range of motion of the joints. We found that the organ is mostly non-concave, making it easy to find the optimal path using gradient descent method. Experimental results demonstrated that the proposed method reduces the range of joint angle movement compared to calculations in position space.

Chinese Translation

用于微创手术的机器人手术可以通过自主引导机器人钳子来减轻外科医生的工作负担。机器人的运动受到固定插入端口的限制。在操作过程中，机器人经常会遇到角度限制。此外，腹腔表面是非凹的，这使得寻找所需路径的计算成本较高。在本研究中，为了解决这些问题，我们提出了一种通过将位置转换为黎曼流形来进行关节空间路径规划的方法。定义了一种边缘成本函数，以在关节空间中搜索所需路径并减少关节的运动范围。我们发现器官大多是非凹的，这使得使用梯度下降法容易找到最优路径。实验结果表明，与在位置空间中的计算相比，所提出的方法减少了关节角度运动的范围。

View on arXiv Download PDF AI Translation

cs.RO / 78 / 2603.14887

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

ViSA：用于广义目标空间对比强化学习的访问状态增强

Nakamura, Issa, Yamanokuchi, Tomoya, Kadokawa, Yuki, Qu, Jia, Otsub, Shun, Miyamoto, Ken, Miwa, Shotaro, Matsubara, Takamitsu

Abstract

Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via contrastive learning, achieving higher sample efficiency compared to conventional methods. However, since CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address this issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples, with the aim of augmenting hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space by reformulating the objective function of the embedding space based on mutual information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, which permits accurate value estimation for hard-to-visit goals. Further details can be found on the project page: \href{https://issa-n.github.io/projectPage_ViSA/}{\texttt{https://issa-n.github.io/projectPage\_ViSA/}}

Chinese Translation

目标条件强化学习（GCRL）是一种学习能够达到任意给定目标的策略的框架。特别地，对比强化学习（CRL）提供了一种使用通过对比学习估计的价值函数近似进行策略更新的框架，相比于传统方法，能够实现更高的样本效率。然而，由于CRL在学习过程中将访问状态视为伪目标，因此只能对有限的目标准确估计价值函数。为了解决这一问题，我们提出了一种名为ViSA（访问状态增强）的新型数据增强方法。ViSA由两个组成部分构成：1）生成增强状态样本，旨在增强在策略探索过程中难以访问的状态样本；2）学习一致的嵌入空间，该部分使用增强状态作为辅助信息，通过基于互信息重新构造嵌入空间的目标函数来规范化嵌入空间。我们在模拟和真实世界的机器人任务中评估了ViSA，并展示了改进的目标空间泛化能力，使得对难以访问的目标能够进行准确的价值估计。更多细节请参见项目页面： exttt{https://issa-n.github.io/projectPage extunderscore ViSA/}

View on arXiv Download PDF AI Translation

cs.RO / 79 / 2603.14900

From Folding Mechanics to Robotic Function: A Unified Modeling Framework for Compliant Origami

从折叠力学到机器人功能：一种统一的柔性折纸建模框架

Zhang, Bohan, Wang, Bo, Ouyang, Huajiang, Wu, Zhigang, Bi, Haohao, Xu, Jiawei, Liu, Mingchao, Huang, Weicheng

Abstract

Origami inspired architectures offer a powerful route toward lightweight, reconfigurable, and programmable robotic systems. Yet, a unified mechanics framework capable of seamlessly bridging rigid folding, elastic deformation, and stability driven transitions in compliant origami remains lacking. Here, we introduce a geometry consistent modeling framework based on discrete differential geometry (DDG) that unifies panel elasticity and crease rotation within a single variational formulation. By embedding crease panel coupling directly into a mid edge geometric discretization, the framework naturally captures rigid folding limits, distributed bending, multistability, and nonlinear dynamic snap through within one mechanically consistent structure. This unified description enables programmable control of stability and deformation across rigid and compliant regimes, allowing origami structures to transition from static folding mechanisms to active robotic modules. An implicit dynamic formulation incorporating gravity, contact, friction, and magnetic actuation further supports strongly coupled multiphysics simulations. Through representative examples spanning single fold bifurcation, deployable Miura membranes, bistable Waterbomb modules, and Kresling based crawling robots, we demonstrate how geometry driven mechanics directly informs robotic functionality. This work establishes discrete differential geometry as a foundational design language for intelligent origami robotics, enabling predictive modeling, stability programming, and mechanics guided robotic actuation within a unified computational platform.

Chinese Translation

受折纸启发的结构为轻量化、可重构和可编程的机器人系统提供了一条强有力的途径。然而，目前仍缺乏一种统一的力学框架，能够无缝连接刚性折叠、弹性变形和驱动稳定性的过渡。本文介绍了一种基于离散微分几何（DDG）的几何一致性建模框架，该框架将面板弹性和折痕旋转统一在一个变分公式中。通过将折痕面板耦合直接嵌入中边几何离散化，该框架自然捕捉到刚性折叠极限、分布弯曲、多稳定性和非线性动态瞬态行为，形成一个力学一致的结构。这种统一的描述使得在刚性和柔性状态下对稳定性和变形的可编程控制成为可能，使折纸结构能够从静态折叠机制过渡到主动的机器人模块。一个隐式动态公式结合了重力、接触、摩擦和磁驱动，进一步支持强耦合的多物理场仿真。通过涵盖单折分岔、可展开的Miura膜、双稳态水弹模块和基于Kresling的爬行机器人等代表性实例，我们展示了几何驱动的力学如何直接影响机器人功能。本研究确立了离散微分几何作为智能折纸机器人设计的基础语言，使得在统一的计算平台上实现预测建模、稳定性编程和力学引导的机器人驱动成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 80 / 2603.14908

PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning

PerlAD：基于伪仿真的强化学习提升闭环端到端自主驾驶能力

Gao, Yinfeng, Zhang, Qichao, Liu, Deqing, Xia, Zhongpu, Li, Guang, Ma, Kun, Chen, Guang, Ye, Hangjun, Chen, Long, Ding, Da-Wei, Zhao, Dongbin

Abstract

End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.

Chinese Translation

基于模仿学习（Imitation Learning, IL）的端到端自主驾驶策略在闭环执行中常常面临挑战，这主要是由于不充分的开环训练目标与真实驾驶需求之间的错位。尽管强化学习（Reinforcement Learning, RL）通过奖励信号直接优化驾驶目标提供了解决方案，但基于渲染的训练环境引入了渲染差距，并且由于高计算成本而效率低下。为了解决这些挑战，我们提出了一种新颖的基于伪仿真的RL方法，称为PerlAD，旨在实现闭环端到端自主驾驶。PerlAD基于离线数据集构建了一个在向量空间中运行的伪仿真，从而实现高效的无渲染试错训练。为了弥合静态数据集与动态闭环环境之间的差距，PerlAD引入了一种预测世界模型，该模型生成基于自我车辆计划的反应性代理轨迹。此外，为了促进高效规划，PerlAD利用了一个层次解耦规划器，该规划器结合了IL用于横向路径生成和RL用于纵向速度优化。全面的实验结果表明，PerlAD在Bench2Drive基准测试中实现了最先进的性能，驾驶得分比之前的E2E RL方法提高了10.29%，且无需昂贵的在线交互。对DOS基准的额外评估进一步确认了其在处理安全关键遮挡场景中的可靠性。

View on arXiv Download PDF AI Translation

cs.RO / 81 / 2603.14972

Learning from Mistakes: Post-Training for Driving VLA with Takeover Data

从错误中学习：基于接管数据的驾驶视觉-语言-动作后训练

Gao, Yinfeng, Liu, Deqing, Zhang, Qichao, Zheng, Yupeng, Tian, Haochen, Li, Guang, Ye, Hangjun, Chen, Long, Ding, Da-Wei, Zhao, Dongbin

Abstract

Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety margins, and passive preference optimization lacks active exploration for optimal performance. In this paper, we propose TakeVLA, a novel VLA post-training framework that overcomes these shortcomings through two complementary innovations. First, we introduce pre-takeover language supervision, which allows the VLA to learn from mistakes proactively. By explicitly teaching the model about what to do in error-prone situations, we cultivate a precautionary mindset that anticipates hazards early and substantially enlarges safety margins. Second, we propose Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstruceted takeover scenarios, encouraging active exploration beyond mere preference fitting. Experiments on the Bench2Drive benchmark demonstrate that TakeVLA achieves state-of-the-art closed-loop performance, surpassing the strong VLA baseline SimLingo by 4.93 in driving score, with an enhanced safety margin as evidenced by an 11.76% increase in average TTC.

Chinese Translation

目前的端到端自主驾驶视觉-语言-动作（VLA）范式依赖于来自静态数据集的离线训练，导致其对分布转移的脆弱性。最近的后训练方法使用接管数据来减轻这一问题，通过用高质量的专家接管样本增强数据集，但它们面临两个关键限制：监督仅限于接管时刻后的时间段，导致政策的安全边际有限；而被动的偏好优化缺乏对最佳性能的主动探索。本文提出了TakeVLA，这是一种新颖的VLA后训练框架，通过两个互补的创新克服了这些缺点。首先，我们引入了接管前语言监督，使VLA能够主动从错误中学习。通过明确教导模型在易出错的情况下该如何应对，我们培养了一种预防性的思维方式，能够提前预见危险并显著扩大安全边际。其次，我们提出了情境梦境（Scenario Dreaming），这是一种在重新构建的接管场景中进行强化微调的范式，鼓励主动探索，而不仅仅是适应偏好。Bench2Drive基准测试的实验表明，TakeVLA在闭环性能上达到了最先进的水平，驾驶得分比强基线SimLingo高出4.93，且安全边际得到了提升，平均时间到冲突（TTC）提高了11.76%。

View on arXiv Download PDF AI Translation

cs.RO / 82 / 2603.14977

ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy

ReMAP-DP：用于扩散策略的重投影多视图对齐点图

Yang, Xinzhang, Wu, Renjun, Liu, Jinyan, Li, Xuesong

Abstract

Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/

Chinese Translation

基于2D视觉表示的通用机器人策略在语义推理方面表现出色，但天然缺乏高精度任务所需的明确3D空间感知。现有的3D整合方法由于稀疏点云的结构不规则性以及多视图正交渲染引入的几何畸变，难以弥合这一差距。为了克服这些障碍，我们提出了ReMAP-DP，一个将标准化的透视重投影与结构感知的双流扩散策略相结合的新框架。通过将重投影的视图与像素对齐的点图（PointMaps）结合，我们的双流体系结构利用可学习的模态嵌入来融合冻结的语义特征和明确的几何描述符，确保精确的隐式补丁级别对齐。在模拟和现实环境中的大量实验表明，ReMAP-DP在多样的操纵任务中表现优越。在RoboTwin 2.0上，其平均成功率达到59.3%，比DP3基线提高了6.6%。在ManiSkill 3上，我们的方法在几何挑战性较大的堆叠立方体（Stack Cube）任务上相较于DP3提升了28%。此外，ReMAP-DP表现出卓越的现实世界鲁棒性，仅需少量演示即可实现高精度和动态操纵，且数据效率更高。项目页面可访问： https://icr-lab.github.io/ReMAP-DP/

View on arXiv Download PDF AI Translation

cs.RO / 83 / 2603.15013

CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control

CycleRL：用于稳健自主自行车控制的仿真到现实深度强化学习

Liu, Gelu, Wang, Teng, Wu, Zhijie, Wu, Junliang, Li, Songyuan, Zhu, Xiangwei

Abstract

Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics, however, conventional control strategies often struggle with their underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, this paper presents CycleRL, the first sim-to-real deep reinforcement learning framework designed for robust autonomous bicycle control. Our approach trains an end-to-end neural control policy within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to circumvent the need for an explicit dynamics model. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves considerable performance, including a 99.90% balance success rate, a low steering tracking error of 1.15{\deg}, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware transfer, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at https://anony6f05.github.io/CycleRL/.

Chinese Translation

自主自行车为城市出行和最后一公里物流提供了一种有前景的灵活解决方案，然而，传统控制策略常常难以应对其欠驱动的非线性动态，容易受到模型不匹配的影响，并且对现实世界的不确定性适应性有限。为了解决这一问题，本文提出了CycleRL，这是第一个为稳健自主自行车控制设计的仿真到现实深度强化学习框架。我们的方法在高保真度的NVIDIA Isaac Sim环境中训练端到端的神经控制策略，利用近端策略优化（Proximal Policy Optimization, PPO）来避免对显式动态模型的需求。该框架具有一个复合奖励函数，旨在同时维护平衡、跟踪速度和控制转向。重要的是，系统的领域随机化被用于缩小仿真与现实之间的差距，并促进直接转移。在仿真中，CycleRL实现了显著的性能，包括99.90%的平衡成功率、1.15°的低转向跟踪误差和0.18 m/s的速度跟踪误差。这些定量结果，加上成功的硬件转移，验证了深度强化学习（DRL）作为自主自行车控制的有效范式，提供了比传统方法更优越的适应性。视频演示可在 https://anony6f05.github.io/CycleRL/ 获取。

View on arXiv Download PDF AI Translation

cs.RO / 84 / 2603.15046

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

AnoleVLA：用于移动操作的轻量级视觉-语言-动作模型与深度状态空间模型

Takagi, Yusuke, Kambara, Motonari, Yashima, Daichi, Seno, Koki, Tokura, Kento, Sugiura, Komei

Abstract

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.

Chinese Translation

在本研究中，我们解决了语言指导的机器人操作问题，其中机器人需要根据视觉观察和自然语言指令操作各种物体。这个任务对于在人工环境中工作的服务机器人至关重要，并且需要安全性、效率和任务级别的通用性。尽管视觉-语言-动作模型（VLA）在此任务中表现出色，但由于标准变换器骨干网络的计算成本，其在资源受限环境中的部署仍然具有挑战性。为了解决这一限制，我们提出了AnoleVLA，这是一种轻量级的VLA，利用深度状态空间模型高效处理多模态序列。该模型利用其轻量级和快速的序列状态建模来处理视觉和文本输入，从而使机器人能够高效生成轨迹。我们在仿真和实际实验中评估了所提出的方法。值得注意的是，在现实世界的评估中，AnoleVLA在任务成功率上比一个代表性的规模较大的VLA提高了21个百分点，同时实现了约三倍的推理速度。

View on arXiv Download PDF AI Translation

cs.RO / 85 / 2603.15066

Multi-Mode Pneumatic Artificial Muscles Driven by Hybrid Positive-Negative Pressure

混合正负压驱动的多模式气动人工肌肉

Feng, Siyuan, Feng, Ruoyu, Li, Shuguang

Abstract

Artificial muscles embody human aspirations for engineering lifelike robotic movements. This paper introduces an architecture for Inflatable Fluid-Driven Origami-Inspired Artificial Muscles (IN-FOAMs). A typical IN-FOAM consists of an inflatable skeleton enclosed within an outer skin, which can be driven using a combination of positive and negative pressures (e.g., compressed air and vacuum). IN-FOAMs are manufactured using low-cost heat-sealable sheet materials through heat-pressing and heat-sealing processes. Thus, they can be ultra-thin when not actuated, making them flexible, lightweight, and portable. The skeleton patterns are programmable, enabling a variety of motions, including contracting, bending, twisting, and rotating, based on specific skeleton designs. We conducted comprehensive experimental, theoretical, and numerical studies to investigate IN-FOAM's basic mechanical behavior and properties. The results show that IN-FOAM's output force and contraction can be tuned through multiple operation modes with the applied hybrid positive-negative pressure. Additionally, we propose multilayer skeleton structures to enhance the contraction ratio further, and we demonstrate a multi-channel skeleton approach that allows the integration of multiple motion modes into a single IN-FOAM. These findings indicate that IN-FOAMs hold great potential for future applications in flexible wearable devices and compact soft robotic systems.

Chinese Translation

人工肌肉体现了人类对工程仿生机器人运动的渴望。本文介绍了一种充气流体驱动的折纸启发式人工肌肉（IN-FOAMs）的结构。典型的IN-FOAM由一个充气骨架和外部皮肤组成，能够通过正负压力的组合（例如，压缩空气和真空）进行驱动。IN-FOAM采用低成本的热封合薄膜材料，通过热压和热封工艺制造。因此，在未激活时，它们可以超薄，使其灵活、轻便且便于携带。骨架图案是可编程的，能够基于特定的骨架设计实现多种运动，包括收缩、弯曲、扭转和旋转。我们进行了全面的实验、理论和数值研究，以探讨IN-FOAM的基本机械行为和特性。结果表明，IN-FOAM的输出力和收缩可以通过施加的混合正负压力在多种操作模式下进行调节。此外，我们提出了多层骨架结构，以进一步增强收缩比，并展示了一种多通道骨架方法，允许将多种运动模式集成到单个IN-FOAM中。这些发现表明，IN-FOAM在未来的柔性可穿戴设备和紧凑型软机器人系统中具有巨大的应用潜力。

View on arXiv Download PDF AI Translation

cs.RO / 86 / 2603.15084

HALO:Closing Sim-to-Real Gap for Heavy-loaded Humanoid Agile Motion Skills via Differentiable Simulation

HALO：通过可微分仿真缩小重载人形机器人灵活运动技能的仿真与现实差距

Wang, Xingyi, Zhang, Chenyun, Xie, Weiji, Yu, Chao, Song, Wei, Bai, Chenjia, Zhu, Shiqiang

Abstract

Humanoid robots deployed in real-world scenarios often need to carry unknown payloads, which introduce significant mismatch and degrade the effectiveness of simulation-to-reality reinforcement learning methods. To address this challenge, we propose a two-stage gradient-based system identification framework built on the differentiable simulator MuJoCo XLA. The first stage calibrates the nominal robot model using real-world data to reduce intrinsic sim-to-real discrepancies, while the second stage further identifies the mass distribution of the unknown payload. By explicitly reducing structured model bias prior to policy training, our approach enables zero-shot transfer of reinforcement learning policies to hardware under heavy-load conditions. Extensive simulation and real-world experiments demonstrate more precise parameter identification, improved motion tracking accuracy, and substantially enhanced agility and robustness compared to existing baselines. Project Page: https://mwondering.github.io/halo-humanoid/

Chinese Translation

在现实场景中部署的人形机器人通常需要携带未知的负载，这会导致显著的不匹配并降低仿真到现实强化学习方法的有效性。为了解决这一挑战，我们提出了一种基于梯度的两阶段系统识别框架，该框架建立在可微分仿真器 MuJoCo XLA 之上。第一阶段利用真实世界数据校准名义机器人模型，以减少内在的仿真与现实差异，而第二阶段则进一步识别未知负载的质量分布。通过在策略训练之前显式减少结构化模型偏差，我们的方法能够在重载条件下实现强化学习策略的零样本迁移到硬件。大量的仿真和现实世界实验表明，与现有基线相比，我们的方法在参数识别精度、运动跟踪准确性以及敏捷性和鲁棒性方面都有显著提升。项目页面：https://mwondering.github.io/halo-humanoid/

View on arXiv Download PDF AI Translation

cs.RO / 87 / 2603.15097

AeroGrab: A Unified Framework for Aerial Grasping in Cluttered Environments

AeroGrab：一种用于杂乱环境中空中抓取的统一框架

Singh, Shivansh Pratap, Nair, Naveen Sudheer, Ujjawal, Samaksh, Mishra, Sarthak, Patil, Soham, Yadav, Rishabh Dev, Roy, Spandan

Abstract

Reliable aerial grasping in cluttered environments remains challenging due to occlusions and collision risks. Existing aerial manipulation pipelines largely rely on centroid-based grasping and lack integration between the grasp pose generation models, active exploration, and language-level task specification, resulting in the absence of a complete end-to-end system. In this work, we present an integrated pipeline for reliable aerial grasping in cluttered environments. Given a scene and a language instruction, the system identifies the target object and actively explores it to gain better views of the object. During exploration, a grasp generation network predicts multiple 6-DoF grasp candidates for each view. Each candidate is evaluated using a collision-aware feasibility framework, and the overall best grasp is selected and executed using standard trajectory generation and control methods. Experiments in cluttered real-world scenarios demonstrate robust and reliable grasp execution, highlighting the effectiveness of combining active perception with feasibility-aware grasp selection for aerial manipulation.

Chinese Translation

在杂乱环境中实现可靠的空中抓取仍然面临挑战，主要由于遮挡和碰撞风险。现有的空中操作流程在很大程度上依赖于基于质心的抓取，并且缺乏抓取姿态生成模型、主动探索和语言层面任务规范之间的整合，导致缺乏完整的端到端系统。在本研究中，我们提出了一种集成管道，以实现杂乱环境中可靠的空中抓取。给定一个场景和语言指令，系统识别目标物体并主动探索，以获得更好的物体视图。在探索过程中，抓取生成网络为每个视图预测多个六自由度（6-DoF）抓取候选。每个候选使用考虑碰撞的可行性框架进行评估，并选择整体最佳抓取，通过标准轨迹生成和控制方法执行。在杂乱的真实场景中的实验展示了稳健且可靠的抓取执行，突显了将主动感知与考虑可行性的抓取选择相结合在空中操作中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 88 / 2603.15108

BodyGuards: Escorting by Multiple Robots in Unknown Environment under Limited Communication

BodyGuards：在有限通信条件下多机器人在未知环境中的护送

Tian, Zhuoli, Bao, Yanze, Guo, Meng

Abstract

Multi-robot systems are increasingly deployed in high-risk missions such as reconnaissance, disaster response, and subterranean operations. Protecting a human operator while navigating unknown and adversarial environments remains a critical challenge, especially when the communication among the operator and robots is restricted. Unlike existing collaborative exploration methods that aim for complete coverage, this work focuses on task-oriented exploration to minimize the navigation time of the operator to reach its goal while ensuring safety under adversarial threats. A novel escorting framework BodyGuards, is proposed to explicitly integrate seamlessly collaborative exploration, inter-robot-operator communication and escorting. The framework consists of three core components: (I) a dynamic movement strategy for the operator that maintains a local map with risk zones for proactive path planning; (II) a dual-mode robotic strategy combining frontier based exploration with optimized return events to balance exploration, threat detection, and intermittent communication; and (III) multi-robot coordination protocols that jointly plan exploration and information sharing for efficient escorting. Extensive human-in-the-loop simulations and hardware experiments demonstrate that the method significantly reduces operator risk and mission time, outperforming baselines in adversarial and constrained environments.

Chinese Translation

多机器人系统越来越多地被部署在高风险任务中，如侦察、灾难响应和地下作业。在未知和对抗性环境中保护人类操作员的安全仍然是一个关键挑战，尤其是在操作员与机器人之间的通信受到限制的情况下。与现有的旨在实现全面覆盖的协作探索方法不同，本研究关注于任务导向的探索，以最小化操作员到达目标的导航时间，同时确保在对抗威胁下的安全性。我们提出了一种新颖的护送框架 BodyGuards，旨在无缝集成协作探索、机器人与操作员之间的通信和护送。该框架由三个核心组件组成：（I）一种动态移动策略，为操作员维护带有风险区域的局部地图，以进行主动路径规划；（II）一种双模式机器人策略，结合基于前沿的探索与优化的返回事件，以平衡探索、威胁检测和间歇性通信；（III）多机器人协调协议，共同规划探索和信息共享，以实现高效护送。大量的人机交互仿真和硬件实验表明，该方法显著降低了操作员的风险和任务时间，在对抗性和受限环境中优于基线方法。

View on arXiv Download PDF AI Translation

cs.RO / 89 / 2603.15126

A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements

一种新颖的相机与机器人标定方法用于基于视觉的地面测量

Rudolph, Jan Andre, Haitz, Dennis, Ulrich, Markus

Abstract

A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.

Chinese Translation

提出了一种新颖的手眼标定方法，适用于地面观察移动机器人。尽管移动机器人上常配备相机，但它们很少用于地面观察测量任务。激光跟踪器在机器人领域越来越多地用于精确定位。设计了一种参考板，以结合激光跟踪器三维计量和基于相机的二维成像这两种测量模式。该参考板包含用于激光跟踪器姿态获取的反射器窝和一个由机器人安装的相机观察的相机标定靶。该过程包括估计参考板姿态、参考板与相机的姿态以及机器人姿态，随后计算机器人与相机之间的变换。实验结果表明，其重复性达到亚毫米级。

View on arXiv Download PDF AI Translation

cs.RO / 90 / 2603.15134

Confusion-Aware In-Context-Learning for Vision-Language Models in Robotic Manipulation

面向混淆的上下文学习在机器人操作中的视觉-语言模型

He, Yayun, Kang, Zuheng, Zhao, Botao, Wu, Zhouyin, Peng, Junqing, Wang, Jianzong

Abstract

Vision-language models (VLMs) have significantly improved the generalization capabilities of robotic manipulation. However, VLM-based systems often suffer from a lack of robustness, leading to unpredictable errors, particularly in scenarios involving confusable objects. Our preliminary analysis reveals that these failures are mainly caused by shortcut learning problem inherently in VLMs, limiting their ability to accurately distinguish between confusable features. To this end, we propose Confusion-Aware In-Context Learning (CAICL), a method that enhances VLM performance in confusable scenarios for robotic manipulation. The approach begins with confusion localization and analysis, identifying potential sources of confusion. This information is then used as a prompt for the VLM to focus on features most likely to cause misidentification. Extensive experiments on the VIMA-Bench show that CAICL effectively addresses the shortcut learning issue, achieving a 85.5\% success rate and showing good stability across tasks with different degrees of generalization.

Chinese Translation

视觉-语言模型（VLMs）显著提升了机器人操作的泛化能力。然而，基于VLM的系统往往缺乏鲁棒性，导致不可预测的错误，特别是在涉及混淆对象的场景中。我们的初步分析表明，这些失败主要是由于VLM中固有的捷径学习问题，限制了它们准确区分混淆特征的能力。为此，我们提出了面向混淆的上下文学习（Confusion-Aware In-Context Learning, CAICL），一种增强VLM在混淆场景中机器人操作性能的方法。该方法首先进行混淆定位和分析，识别潜在的混淆源。然后，这些信息被用作VLM的提示，以关注最可能导致错误识别的特征。在VIMA-Bench上的大量实验表明，CAICL有效解决了捷径学习问题，成功率达到85.5%，并在不同泛化程度的任务中表现出良好的稳定性。

View on arXiv Download PDF AI Translation

cs.RO / 91 / 2603.15152

Master Micro Residual Correction with Adaptive Tactile Fusion and Force-Mixed Control for Contact-Rich Manipulation

自适应触觉融合与力混合控制下的接触丰富操作中的主微残差校正

Li, Xingting, Xie, Yifan, Liu, Han, Hou, Wei, Chen, Guangyu, Li, Shoujie, Ding, Wenbo

Abstract

Robotic contact-rich and fine-grained manipulation remains a significant challenge due to complex interaction dynamics and the competing requirements of multi-timescale control. While current visual imitation learning methods excel at long-horizon planning, they often fail to perceive critical interaction cues like friction variations or incipient slip, and struggle to balance global task coherence with local reactive feedback. To address these challenges, we propose M2-ResiPolicy, a novel Master-Micro residual control architecture that synergizes high-level action guidance with low-level correction. The framework consists of a Master-Guidance Policy (MGP) operating at 10 Hz, which generates temporally consistent action chunks via a diffusion-based backbone and employs a tactile-intensity-driven adaptive fusion mechanism to dynamically modulate perceptual weights between vision and touch. Simultaneously, a high-frequency (60 Hz) Micro-Residual Corrector (MRC) utilizes a lightweight GRU to provide real-time action compensation based on TCP wrench feedback. This policy is further integrated with a force-mixed PBIC execution layer, effectively regulating contact forces to ensure interaction safety. Experiments across several demanding tasks including fragile object grasping and precision insertion, demonstrate that M2-ResiPolicy significantly outperforms standard Diffusion Policy (DP) and state-of-the-art Reactive Diffusion Policy (RDP), achieving a 93\% damage-free success rate in chip grasping and superior force regulation stability.

Chinese Translation

由于复杂的交互动态和多时间尺度控制的竞争要求，机器人在接触丰富和精细操作方面仍然面临重大挑战。虽然当前的视觉模仿学习方法在长时间规划方面表现出色，但它们往往无法感知关键的交互线索，如摩擦变化或初始滑动，并且在全球任务一致性与局部反应反馈之间难以取得平衡。为了解决这些挑战，我们提出了M2-ResiPolicy，一种新颖的主微残差控制架构，协同高层次的动作指导与低层次的校正。该框架由一个以10 Hz运行的主指导策略（MGP）组成，通过基于扩散的骨干网络生成时间一致的动作块，并采用触觉强度驱动的自适应融合机制动态调节视觉与触觉之间的感知权重。同时，一个高频（60 Hz）的微残差校正器（MRC）利用轻量级的GRU根据TCP扭矩反馈提供实时动作补偿。该策略进一步与力混合的PBIC执行层集成，有效调节接触力以确保交互安全。在包括脆弱物体抓取和精确插入等多个高难度任务中的实验表明，M2-ResiPolicy显著优于标准扩散策略（DP）和最先进的反应扩散策略（RDP），在芯片抓取中实现了93%的无损成功率和卓越的力调节稳定性。

View on arXiv Download PDF AI Translation

cs.RO / 92 / 2603.15169

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

ForceVLA2：释放具有力感知的混合力-位置控制以实现接触丰富的操作

Li, Yang, Zhaxizhuoma, Jiang, Hongru, Xia, Junjie, Zhang, Hongquan, Du, Jinda, Zhou, Yunsong, Zeng, Jia, Hao, Ce, Ren, Jieji, Yu, Qiaojun, Lu, Cewu, Qiao, Yu, Pang, Jiangmiao

Abstract

Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs. The project page is available at https://sites.google.com/view/force-vla2/home.

Chinese Translation

在接触丰富的操作中，具身智能主要依赖于位置控制，而对交互力的明确感知和调节仍然未得到充分探索，这限制了在现实任务中的稳定性、精确性和鲁棒性。我们提出了ForceVLA2，一个端到端的视觉-语言-动作框架，使机器人具备混合力-位置控制和明确的力感知。ForceVLA2将基于力的提示引入VLM专家，以构建跨阶段的力感知任务概念，并在动作专家中采用跨尺度混合专家（Cross-Scale Mixture-of-Experts, MoE）来自适应地将这些概念与实时交互力融合，实现闭环混合力-位置调节。为了支持学习和评估，我们构建了ForceVLA2-Dataset，包含1000条轨迹，涵盖5个接触丰富的任务，包括擦拭、按压和组装，配有多视角图像、任务提示、自我感知状态和力信号。大量实验表明，ForceVLA2显著提高了接触丰富操作的成功率和可靠性，在5个任务中分别比pi0和pi0.5提高了48.0%和35.0%，并缓解了常见的失败模式，如臂部过载和不稳定接触，从而积极推动了VLAs中的力感知交互物理智能。项目页面可访问：https://sites.google.com/view/force-vla2/home。

View on arXiv Download PDF AI Translation

cs.RO / 93 / 2603.15179

KiRAS: Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning in Quadruped Robots

KiRAS：基于关键帧引导的自我模仿用于四足机器人鲁棒且自适应的技能学习

Wei, Xiaoyi, Zhai, Peng, Tu, Jiaxin, Zhang, Yueqi, Li, Yuqi, Zhang, Zonghao, Zhou, Hu, Zhang, Lihua

Abstract

With advances in reinforcement learning and imitation learning, quadruped robots can acquire diverse skills within a single policy by imitating multiple skill-specific datasets. However, the lack of datasets on complex terrains limits the ability of such multi-skill policies to generalize effectively in unstructured environments. Inspired by animation, we adopt keyframes as minimal and universal skill representations, relaxing dataset constraints and enabling the integration of terrain adaptability with skill diversity. We propose Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning (KiRAS), an end-to-end framework for acquiring and transitioning between diverse skill primitives on complex terrains. KiRAS first learns diverse skills on flat terrain through keyframe-guided self-imitation, eliminating the need for expert datasets; then continues training the same policy network on rough terrains to enhance robustness. To eliminate catastrophic forgetting, a proficiency-based Skill Initialization Technique is introduced. Experiments on Solo-8 and Unitree Go1 robots show that KiRAS enables robust skill acquisition and smooth transitions across challenging terrains. This framework demonstrates its potential as a lightweight platform for multi-skill generation and dataset collection. It further enables flexible skill transitions that enhance locomotion on challenging terrains.

Chinese Translation

随着强化学习和模仿学习的进展，四足机器人能够通过模仿多个特定技能的数据集在单一策略下获得多样化的技能。然而，复杂地形上缺乏数据集限制了这种多技能策略在非结构化环境中有效泛化的能力。受到动画的启发，我们采用关键帧作为最小且通用的技能表示，放宽了数据集的限制，并使技能多样性与地形适应性得以整合。我们提出了基于关键帧引导的自我模仿鲁棒且自适应技能学习框架（KiRAS），这是一个端到端的框架，用于在复杂地形上获取和转换多样化的技能原语。KiRAS首先通过关键帧引导的自我模仿在平坦地形上学习多样化的技能，消除了对专家数据集的需求；然后在粗糙地形上继续训练相同的策略网络以增强鲁棒性。为消除灾难性遗忘，提出了一种基于熟练度的技能初始化技术。在Solo-8和Unitree Go1机器人上的实验表明，KiRAS能够实现鲁棒的技能获取和在具有挑战性的地形上的平滑过渡。该框架展示了其作为多技能生成和数据集收集的轻量级平台的潜力，并进一步实现了灵活的技能过渡，增强了在具有挑战性地形上的运动能力。

View on arXiv Download PDF AI Translation

cs.RO / 94 / 2603.15185

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

可扩展和稳健学习在端到端驾驶规划中的关键因素是什么？

Holtz, David, Hanselmann, Niklas, Doll, Simon, Cordts, Marius, Schiele, Bernt

Abstract

End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/

Chinese Translation

端到端自主驾驶因其在交互场景中学习稳健行为的潜力以及与数据的扩展性而受到广泛关注。流行的架构通常基于分离的感知和规划模块，通过潜在表示（如鸟瞰图特征网格）连接，以保持端到端的可微分性。这一范式主要在开环数据集上发展，评估不仅关注驾驶性能，还包括中间感知任务。不幸的是，在开环中表现优异的架构进展往往无法转化为稳健闭环驾驶的可扩展学习。在本文中，我们系统地重新审视了常见架构模式对闭环性能的影响：（1）高分辨率感知表示，（2）解耦轨迹表示，以及（3）生成规划。关键是，我们的分析评估了这些模式的综合影响，揭示了意想不到的局限性以及未被充分探索的协同效应。在这些见解的基础上，我们提出了 BevAD，一种新颖的轻量级和高度可扩展的端到端驾驶架构。BevAD 在 Bench2Drive 基准测试中达到了 72.7% 的成功率，并展示了使用纯模仿学习的强数据扩展行为。我们的代码和模型可在此公开获取：https://dmholtz.github.io/bevad/

View on arXiv Download PDF AI Translation

cs.RO / 95 / 2603.15186

NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation

NavGSim：用于大规模导航的高保真高斯点云模拟器

Liu, Jiahang, Duan, Yuanxing, Zhang, Jiazhao, Li, Minghan, Wang, Shaoan, Zhang, Zhizheng, Wang, He

Abstract

Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSim's effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA model's scene understanding, enabling the policy to handle diverse navigation queries effectively.

Chinese Translation

为机器人模拟现实环境被广泛认为是机器人学习中的一项关键挑战，特别是在渲染和物理模拟方面。在导航任务中，这一挑战尤为突出，因为轨迹通常跨越多个房间或整个楼层。在本研究中，我们提出了NavGSim，这是一种基于高斯点云的模拟器，旨在生成高保真、大规模的导航环境。NavGSim建立在一个分层的3D高斯点云框架之上，能够在覆盖数百平方米的广阔场景中实现照片级真实感渲染。为了模拟导航碰撞，我们引入了一种基于高斯点云的切片技术，直接从重建的高斯中提取可导航区域。此外，为了便于使用，我们提供了全面的NavGSim API，支持多GPU开发，包括自定义场景重建、机器人配置、策略训练和评估的工具。为了评估NavGSim的有效性，我们使用从NavGSim收集的轨迹训练了一个视觉-语言-动作（Vision-Language-Action, VLA）模型，并在模拟和真实环境中评估其性能。我们的结果表明，NavGSim显著增强了VLA模型的场景理解能力，使得该策略能够有效处理多样化的导航查询。

View on arXiv Download PDF AI Translation

cs.RO / 96 / 2603.15223

Coupled Particle Filters for Robust Affordance Estimation

耦合粒子滤波器用于鲁棒的可供性估计

Lowin, Patrick, Mengers, Vito, Brock, Oliver

Abstract

Robotic affordance estimation is challenging due to visual, geometric, and semantic ambiguities in sensory input. We propose a method that disambiguates these signals using two coupled recursive estimators for sub-aspects of affordances: graspable and movable regions. Each estimator encodes property-specific regularities to reduce uncertainty, while their coupling enables bidirectional information exchange that focuses attention on regions where both agree, i.e., affordances. Evaluated on a real-world dataset, our method outperforms three recent affordance estimators (Where2Act, Hands-as-Probes, and HRP) by 308%, 245%, and 257% in precision, and remains robust under challenging conditions such as low light or cluttered environments. Furthermore, our method achieves a 70% success rate in our real-world evaluation. These results demonstrate that coupling complementary estimators yields precise, robust, and embodiment-appropriate affordance predictions.

Chinese Translation

机器人可供性估计因传感器输入中的视觉、几何和语义模糊而具有挑战性。我们提出了一种方法，通过两个耦合的递归估计器来消除这些信号的歧义，分别针对可抓取和可移动区域的子方面。每个估计器编码特定属性的规律性，以减少不确定性，而它们的耦合则使双向信息交换成为可能，集中注意力于双方一致的区域，即可供性。在一个真实世界的数据集上进行评估，我们的方法在精度上分别比三个最近的可供性估计器（Where2Act、Hands-as-Probes 和 HRP）提高了308%、245%和257%，并且在低光照或杂乱环境等挑战性条件下保持鲁棒性。此外，我们的方法在真实世界评估中达到了70%的成功率。这些结果表明，耦合互补的估计器能够产生精确、鲁棒且适合体现的可供性预测。

View on arXiv Download PDF AI Translation

cs.RO / 97 / 2603.15254

A Methodology for Dynamic Parameters Identification of 3-DOF Parallel Robots in Terms of Relevant Parameters

基于相关参数的三自由度并联机器人动态参数识别方法

Díaz-Rodríguez, Miguel, Mata, Vicente, Valera, Angel, Page, Alvaro

Abstract

The identification of dynamic parameters in mechanical systems is important for improving model-based control as well as for performing realistic dynamic simulations. Generally, when identification techniques are applied only a subset of so-called base parameters can be identified. More even, some of these parameters cannot be identified properly given that they have a small contribution to the robot dynamics and hence in the presence of noise in measurements and discrepancy in modeling, their quality of being identifiable decreases. For this reason, a strategy for dynamic parameter identification of fully parallel robots in terms of a subset called relevant parameters is put forward. The objective of the proposed methodology is to start from a full dynamic model, then simplification concerning the geometry of each link and, the symmetry due to legs of fully parallel robots, are carried out. After that, the identification is done by Weighted Least Squares. Then, with statistical considerations the model is reduced until the physical feasibility conditions are met. The application of the propose strategy has been experimentally tested on two difierent configurations of actual 3-DOF parallel robots. The response of the inverse and forward dynamics of the identified models agrees with experiments. In order to evaluate the forward dynamics response, an approach for obtaining the forward dynamics in terms of the relevant parameters is also proposed.

Chinese Translation

机械系统中动态参数的识别对于提升基于模型的控制以及进行真实的动态仿真至关重要。通常，当应用识别技术时，仅能识别所谓的基础参数的一个子集。更有甚者，由于某些参数对机器人动态的贡献较小，因此在测量噪声和建模差异的影响下，这些参数的可识别性会降低。基于此，提出了一种针对完全并联机器人动态参数识别的策略，该策略关注于一个称为相关参数的子集。所提出方法的目标是从完整的动态模型出发，然后对每个连杆的几何形状以及完全并联机器人腿部的对称性进行简化。随后，通过加权最小二乘法进行参数识别。接着，基于统计考虑对模型进行简化，直到满足物理可行性条件。该策略的应用已在两种不同配置的实际三自由度并联机器人上进行了实验测试。识别模型的逆向和正向动态响应与实验结果一致。为了评估正向动态响应，还提出了一种基于相关参数获得正向动态的方法。

View on arXiv Download PDF AI Translation

cs.RO / 98 / 2603.15257

HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing

HapticVLA：通过视觉-语言-动作模型进行接触丰富的操作，无需推理时的触觉传感

Gubernatorov, Konstantin, Sannikov, Mikhail, Mikhalchuk, Ilya, Kuznetsov, Egor, Artemov, Makar, Ouwatobi, Ogunwoye Faith, Fernando, Marcelino, Asanov, Artem, Guo, Ziang, Tsetserukou, Dzmitry

Abstract

Tactile sensing is a crucial capability for Vision-Language-Action (VLA) architectures, as it enables dexterous and safe manipulation in contact-rich tasks. However, reliance on dedicated tactile hardware increases cost and reduces reproducibility across robotic platforms. We argue that tactile-aware manipulation can be learned offline and deployed without direct haptic feedback at inference. To this end, we present HapticVLA, which proceeds in two tightly coupled stages: Safety-Aware Reward-Weighted Flow Matching (SA-RWFM) and Tactile Distillation (TD). SA-RWFM trains a flow-matching action expert that incorporates precomputed, safety-aware tactile rewards penalizing excessive grasping force and suboptimal grasping trajectories. TD further transfers this tactile-aware capability into a conventional VLA: we distill a compact tactile token from the SA-RWFM teacher and train a student VLA to predict that token from vision and state modalities, enabling tactile-aware action generation at inference without requiring on-board tactile sensors. This design preserves contact-rich tactile-aware reasoning within VLA while removing the need for on-board tactile sensors during deployment. On real-world experiments, HapticVLA achieves a mean success rate of 86.7%, consistently outperforming baseline VLAs - including versions provided with direct tactile feedback during inference.

Chinese Translation

触觉感知是视觉-语言-动作（VLA）架构的重要能力，因为它使得在接触丰富的任务中进行灵活且安全的操作成为可能。然而，依赖专用的触觉硬件增加了成本，并降低了在机器人平台之间的可重复性。我们认为，触觉感知的操作可以通过离线学习并在推理时无需直接的触觉反馈进行部署。为此，我们提出了HapticVLA，该模型包含两个紧密结合的阶段：安全感知的奖励加权流匹配（SA-RWFM）和触觉蒸馏（TD）。SA-RWFM训练一个流匹配的操作专家，该专家结合了预先计算的安全感知触觉奖励，惩罚过大的抓握力和次优的抓握轨迹。TD进一步将这种触觉感知能力转移到常规的VLA中：我们从SA-RWFM教师中提炼出一个紧凑的触觉标记，并训练一个学生VLA从视觉和状态模态预测该标记，从而在推理时实现触觉感知的动作生成，而无需机载触觉传感器。该设计保留了VLA中的接触丰富触觉感知推理，同时去除了在部署过程中的机载触觉传感器的需求。在真实世界的实验中，HapticVLA达到了86.7%的平均成功率，始终优于基线VLA，包括在推理过程中提供直接触觉反馈的版本。

View on arXiv Download PDF AI Translation

cs.RO / 99 / 2603.15265

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

MoE-ACT：通过稀疏语言条件混合专家变换器扩展多任务双手操作

Guo, Kangjun, Liu, Haichao, Sun, Yanji, Zhao, Ruhan, Zhou, Jinni, Ma, Jun

Abstract

The ability of robots to handle multiple tasks under a unified policy is critical for deploying embodied intelligence in real-world household and industrial applications. However, out-of-distribution variation across tasks often causes severe task interference and negative transfer when training general robotic policies. To address this challenge, we propose a lightweight multi-task imitation learning framework for bimanual manipulation, termed Mixture-of-Experts-Enhanced Action Chunking Transformer (MoE-ACT), which integrates sparse Mixture-of-Experts (MoE) modules into the Transformer encoder of ACT. The MoE layer decomposes a unified task policy into independently invoked expert components. Through adaptive activation, it naturally decouples multi-task action distributions in latent space. During decoding, Feature-wise Linear Modulation (FiLM) dynamically modulates action tokens to improve consistency between action generation and task instructions. In parallel, multi-scale cross-attention enables the policy to simultaneously focus on both low-level and high-level semantic features, providing rich visual information for robotic manipulation. We further incorporate textual information, transitioning the framework from a purely vision-based model to a vision-centric, language-conditioned action generation system. Experimental validation in both simulation and a real-world dual-arm setup shows that MoE-ACT substantially improves multi-task performance. Specifically, MoE-ACT outperforms vanilla ACT by an average of 33% in success rate. These results indicate that MoE-ACT provides stronger robustness and generalization in complex multi-task bimanual manipulation environments. Our open-source project page can be found at https://j3k7.github.io/MoE-ACT/.

Chinese Translation

机器人在统一策略下处理多任务的能力对于在现实世界的家庭和工业应用中部署具身智能至关重要。然而，任务之间的分布外变异常常导致严重的任务干扰和负迁移，从而在训练通用机器人策略时造成困难。为了解决这一挑战，我们提出了一种轻量级的多任务模仿学习框架，用于双手操作，称为混合专家增强动作分块变换器（Mixture-of-Experts-Enhanced Action Chunking Transformer，MoE-ACT），该框架将稀疏混合专家（Mixture-of-Experts，MoE）模块集成到ACT的变换器编码器中。MoE层将统一的任务策略分解为独立调用的专家组件。通过自适应激活，它自然地在潜在空间中解耦多任务动作分布。在解码过程中，特征线性调制（Feature-wise Linear Modulation，FiLM）动态调节动作标记，以提高动作生成与任务指令之间的一致性。同时，多尺度交叉注意力使策略能够同时关注低级和高级语义特征，为机器人操作提供丰富的视觉信息。我们进一步结合文本信息，将框架从纯视觉模型过渡到以视觉为中心的语言条件动作生成系统。在模拟和真实双臂设置中的实验验证表明，MoE-ACT显著提高了多任务性能。具体而言，MoE-ACT的成功率平均比普通ACT提高了33%。这些结果表明，MoE-ACT在复杂的多任务双手操作环境中提供了更强的鲁棒性和泛化能力。我们的开源项目页面可以在 https://j3k7.github.io/MoE-ACT/ 找到。

View on arXiv Download PDF AI Translation

cs.RO / 100 / 2603.15281

GNIO: Gated Neural Inertial Odometry

GNIO：门控神经惯性里程计

Feng, Dapeng, Yin, Yizhen, Chen, Zhiqiang, Qi, Yuhua, Chen, Hongbo

Abstract

Inertial navigation using low-cost MEMS sensors is plagued by rapid drift due to sensor noise and bias instability. While recent data-driven approaches have made significant strides, they often struggle with micro-drifts during stationarity and mode fusion during complex motion transitions due to their reliance on fixed-window regression. In this work, we introduce Gated Neural Inertial Odometry (GNIO), a novel learning-based framework that explicitly models motion validity and context. We propose two key architectural innovations: \ding{182} a learnable Motion Bank that queries a global dictionary of motion patterns to provide semantic context beyond the local receptive field, and \ding{183} a Gated Prediction Head that decomposes displacement into magnitude and direction. This gating mechanism acts as a soft, differentiable Zero-Velocity Update (ZUPT), dynamically suppressing sensor noise during stationary periods while scaling predictions during dynamic motion. Extensive experiments across four public benchmarks demonstrate that GNIO significantly reduces position drift compared to state-of-the-art CNN and Transformer-based baselines. Notably, GNIO achieves a $60.21\%$ reduction in trajectory error on the OxIOD dataset and exhibits superior generalization in challenging scenarios involving frequent stops and irregular motion speeds.

Chinese Translation

使用低成本MEMS传感器进行惯性导航受到传感器噪声和偏差不稳定性导致的快速漂移困扰。尽管最近的数据驱动方法取得了显著进展，但由于依赖固定窗口回归，它们在静止状态下常常难以处理微漂移，并且在复杂运动过渡中难以进行模式融合。在本研究中，我们提出了门控神经惯性里程计（GNIO），这是一个新颖的基于学习的框架，明确建模运动有效性和上下文。我们提出了两个关键的架构创新： extcircled{1} 一个可学习的运动库（Motion Bank），它查询全球运动模式字典，以提供超越局部感受野的语义上下文； extcircled{2} 一个门控预测头（Gated Prediction Head），将位移分解为大小和方向。该门控机制充当一种软的、可微分的零速度更新（Zero-Velocity Update, ZUPT），在静止期间动态抑制传感器噪声，同时在动态运动期间缩放预测。通过在四个公共基准上的广泛实验表明，GNIO显著减少了与最先进的CNN和基于Transformer的基线相比的位置漂移。值得注意的是，GNIO在OxIOD数据集上实现了$60.21\%$的轨迹误差减少，并在涉及频繁停顿和不规则运动速度的挑战性场景中表现出优越的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 101 / 2603.15329

User-Tailored Learning to Forecast Walking Modes for Exosuits

用户定制学习以预测外骨骼的行走模式

Abbate, Gabriele, Tricomi, Enrica, Gierden, Nathalie, Giusti, Alessandro, Masia, Lorenzo, Paolillo, Antonio

Abstract

Assistive robotic devices, like soft lower-limb exoskeletons or exosuits, are widely spreading with the promise of helping people in everyday life. To make such systems adaptive to the variety of users wearing them, it is desirable to endow exosuits with advanced perception systems. However, exosuits have little sensory equipment because they need to be light and easy to wear. This paper presents a perception module based on machine learning that aims at estimating 3 walking modes (i.e., ascending or descending stairs and walking on level ground) of users wearing an exosuit. We tackle this perception problem using only inertial data from two sensors. Our approach provides an estimate for both future and past timesteps that supports control and enables a self-labeling procedure for online model adaptation. Indeed, we show that our estimate can label data acquired online and refine the model for new users. A thorough analysis carried out on real-life datasets shows the effectiveness of our user-tailored perception module. Finally, we integrate our system with the exosuit in a closed-loop controller, validating its performance in an online single-subject experiment.

Chinese Translation

辅助机器人设备，如软性下肢外骨骼或外骨骼服，正在广泛传播，承诺帮助人们改善日常生活。为了使这些系统能够适应穿戴者的多样性，赋予外骨骼服先进的感知系统是非常必要的。然而，外骨骼服的传感器设备较少，因为它们需要轻便且易于穿戴。本文提出了一种基于机器学习的感知模块，旨在估计穿戴外骨骼服的用户的三种行走模式（即，上楼梯、下楼梯和在平地行走）。我们仅使用来自两个传感器的惯性数据来解决这一感知问题。我们的方法为未来和过去的时间步提供估计，支持控制并实现在线模型自标记程序。实际上，我们展示了我们的估计能够标记在线获取的数据，并为新用户优化模型。对真实数据集进行的全面分析显示了我们用户定制感知模块的有效性。最后，我们将我们的系统与外骨骼服集成在一个闭环控制器中，验证了其在在线单一受试者实验中的性能。

View on arXiv Download PDF AI Translation

cs.RO / 102 / 2603.15359

NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation

NavThinker：用于社会导航中的耦合预测和规划的行动条件世界模型

Hu, Tianshuai, Gong, Zeying, Kong, Lingdong, Mei, XiaoDong, Ding, Yiyi, Zeng, Qi, Liang, Ao, Li, Rong, Zhong, Yangyi, Liang, Junwei

Abstract

Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: https://github.com/hutslib/NavThinker.

Chinese Translation

社会导航要求机器人在动态人类环境中安全地行动。有效的行为需要提前思考：推理在不同机器人动作下场景和行人的演变，而不仅仅依赖当前观察进行反应。这就带来了一个耦合的预测-规划挑战，其中机器人的动作与人类的运动相互影响。为了解决这个挑战，我们提出了NavThinker，一个未来意识框架，将行动条件的世界模型与在线强化学习相结合。世界模型在Depth Anything V2补丁特征空间中运行，并进行未来场景几何和人类运动的自回归预测；多头解码器随后生成未来深度图和人类轨迹，从而产生与可通行性和互动风险相一致的未来意识状态。关键是，我们通过DD-PPO训练策略，同时通过以下方式注入世界模型的前瞻信号：(i) 将条件于动作的未来特征融合到当前观察嵌入中，以及 (ii) 从预测的人类轨迹中进行社会奖励塑造。在单机器人和多机器人Social-HM3D上的实验显示了最先进的导航成功，且在Social-MP3D上实现了零样本迁移，并在Unitree Go2上进行了实际部署，验证了模型的一般化和实用性。网页链接： https://github.com/hutslib/NavThinker。

View on arXiv Download PDF AI Translation

cs.RO / 103 / 2603.15410

End-to-End Dexterous Grasp Learning from Single-View Point Clouds via a Multi-Object Scene Dataset

基于多物体场景数据集的单视角点云端到端灵巧抓取学习

Geng, Tao, Yang, Dapeng, Liu, Ziwei, Zhang, Le, Qi, Le, Li, WangYang, Ren, Yi, Luo, Shan, Ni, Fenglei

Abstract

Dexterous grasping in multi-object scene constitutes a fundamental challenge in robotic manipulation. Current mainstream grasping datasets predominantly focus on single-object scenarios and predefined grasp configurations, often neglecting environmental interference and the modeling of dexterous pre-grasp gesture, thereby limiting their generalizability in real-world applications. To address this, we propose DGS-Net, an end-to-end grasp prediction network capable of learning dense grasp configurations from single-view point clouds in multi-object scene. Furthermore, we propose a two-stage grasp data generation strategy that progresses from dense single-object grasp synthesis to dense scene-level grasp generation. Our dataset comprises 307 objects, 240 multi-object scenes, and over 350k validated grasps. By explicitly modeling grasp offsets and pre-grasp configurations, the dataset provides more robust and accurate supervision for dexterous grasp learning. Experimental results show that DGS-Net achieves grasp success rates of 88.63\% in simulation and 78.98\% on a real robotic platform, while exhibiting lower penetration with a mean penetration depth of 0.375 mm and penetration volume of 559.45 mm^3, outperforming existing methods and demonstrating strong effectiveness and generalization capability. Our dataset is available at https://github.com/4taotao8/DGS-Net.

Chinese Translation

在多物体场景中进行灵巧抓取是机器人操作中的一个基本挑战。目前主流的抓取数据集主要集中于单物体场景和预定义的抓取配置，往往忽视了环境干扰和灵巧预抓取姿势的建模，从而限制了其在实际应用中的普适性。为了解决这一问题，我们提出了DGS-Net，一种能够从多物体场景中的单视角点云学习密集抓取配置的端到端抓取预测网络。此外，我们提出了一种两阶段的抓取数据生成策略，该策略从密集单物体抓取合成进展到密集场景级抓取生成。我们的数据集包含307个物体、240个多物体场景以及超过35万个经过验证的抓取。通过明确建模抓取偏移和预抓取配置，该数据集为灵巧抓取学习提供了更稳健和准确的监督。实验结果表明，DGS-Net在仿真中的抓取成功率达到88.63%，在真实机器人平台上的成功率为78.98%，同时表现出较低的穿透率，平均穿透深度为0.375毫米，穿透体积为559.45立方毫米，超越了现有方法，展示了强大的有效性和泛化能力。我们的数据集可在https://github.com/4taotao8/DGS-Net获取。

View on arXiv Download PDF AI Translation

cs.RO / 104 / 2603.15418

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

MA-VLCM：用于多智能体团队环境中策略价值估计的视觉语言评论模型

Shaik, Shahil, Parameshwaran, Aditya, Nayak, Anshul, Smereka, Jonathon M., Wang, Yue

Abstract

Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings

Chinese Translation

多智能体强化学习（MARL）通常依赖于集中式评论者来估计价值函数。然而，从头学习这样的评论者在样本效率上非常低下，并且往往缺乏跨环境的泛化能力。同时，在互联网规模数据上训练的大型视觉-语言-动作模型（VLA）展现出强大的多模态推理和零样本泛化能力，但直接将其用于机器人执行仍然在计算上不可行，特别是在具有多样化体现和资源限制的异构多机器人系统中。为了解决这些挑战，我们提出了多智能体视觉-语言-评论模型（MA-VLCM），该框架用经过微调的预训练视觉-语言模型替代MARL中的学习集中式评论者，以评估多智能体行为。MA-VLCM作为一个集中式评论者，基于自然语言任务描述、视觉轨迹观察和结构化的多智能体状态信息进行条件化。通过在策略优化过程中消除评论者学习，我们的方法显著提高了样本效率，同时生成适合在资源受限的机器人上部署的紧凑执行策略。结果显示，在多智能体团队环境中，不同VLM骨干的模型在分布内和分布外场景下均能良好地进行零样本回报估计。

View on arXiv Download PDF AI Translation

cs.RO / 105 / 2603.15445

Zero-Shot Generalization from Motion Demonstrations to New Tasks

从运动示范到新任务的零样本泛化

Freitag, Kilian, Combrink, Alvin, Figueroa, Nadia

Abstract

Learning motion policies from expert demonstrations is an essential paradigm in modern robotics. While end-to-end models aim for broad generalization, they require large datasets and computationally heavy inference. Conversely, learning dynamical systems (DS) provides fast, reactive, and provably stable control from very few demonstrations. However, existing DS learning methods typically model isolated tasks and struggle to reuse demonstrations for novel behaviors. In this work, we formalize the problem of combining isolated demonstrations within a shared workspace to enable generalization to unseen tasks. The Gaussian Graph is introduced, which reinterprets spatial components of learned motion primitives as discrete vertices with connections to one another. This formulation allows us to bridge continuous control with discrete graph search. We propose two frameworks leveraging this graph: Stitching, for constructing time-invariant DSs, and Chaining, giving a sequence-based DS for complex motions while retaining convergence guarantees. Simulations and real-robot experiments show that these methods successfully generalize to new tasks where baseline methods fail.

Chinese Translation

从专家示范中学习运动策略是现代机器人技术中的一个重要范式。尽管端到端模型旨在实现广泛的泛化，但它们需要大量的数据集和计算密集型的推理。相反，学习动态系统（DS）能够从极少的示范中提供快速、反应灵敏且可证明稳定的控制。然而，现有的DS学习方法通常建模孤立任务，难以重用示范以实现新行为。在本研究中，我们将孤立示范结合在共享工作空间中的问题进行了形式化，以实现对未见任务的泛化。我们引入了高斯图（Gaussian Graph），将学习到的运动原语的空间组件重新解释为相互连接的离散顶点。这种表述使我们能够将连续控制与离散图搜索相结合。我们提出了两种利用该图的框架：拼接（Stitching），用于构建时间不变的DS，以及链式（Chaining），为复杂运动提供基于序列的DS，同时保持收敛保证。仿真和真实机器人实验表明，这些方法成功地泛化到基线方法失败的新任务。

View on arXiv Download PDF AI Translation

cs.RO / 106 / 2603.15469

RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation

2026年AAAI的RoCo挑战赛：面向工业自动化的机器人协作操作组装基准测试

Liu, Haichao, Zhou, Yuheng, Wu, Zhenyu, Ji, Ziheng, Shan, Ziyu, Wang, Qianzhun, Liu, Ruixuan, Yang, Zhiyuan, Gu, Yejun, Khan, Shalman, Yan, Shijun, Liu, Jun, Zhu, Haiyue, Liu, Changliu, Yang, Jianfei, Zhang, Jingbing, Wang, Ziwei

Abstract

Embodied Artificial Intelligence (EAI) is rapidly developing, gradually subverting previous autonomous systems' paradigms from isolated perception to integrated, continuous action. This transition is highly significant for industrial robotic manipulation, promising to free human workers from repetitive, dangerous daily labor. To benchmark and advance this capability, we introduce the Robotic Collaborative Assembly Assistance (RoCo) Challenge with a dataset towards simulation and real-world assembly manipulation. Set against the backdrop of human-centered manufacturing, this challenge focuses on a high-precision planetary gearbox assembly task, a demanding yet highly representative operation in modern industry. Built upon a self-developed data collection, training, and evaluation system in Isaac Sim, and utilizing a dual-arm robot for real-world deployment, the challenge operates in two phases. The Simulation Round defines fine-grained task phases for step-wise scoring to handle the long-horizon nature of the assembly. The Real-World Round mirrors this evaluation with physical gearbox components and high-quality teleoperated datasets. The core tasks require assembling an epicyclic gearbox from scratch, including mounting three planet gears, a sun gear, and a ring gear. Attracting over 60 teams and 170+ participants from more than 10 countries, the challenge yielded highly effective solutions, most notably ARC-VLA and RoboCola. Results demonstrate that a dual-model framework for long-horizon multi-task learning is highly effective, and the strategic utilization of recovery-from-failure curriculum data is a critical insight for successful deployment. This report outlines the competition setup, evaluation approach, key findings, and future directions for industrial EAI. Our dataset, CAD files, code, and evaluation results can be found at: https://rocochallenge.github.io/RoCo2026/.

Chinese Translation

具身人工智能（EAI）正在快速发展，逐渐颠覆以往孤立感知的自主系统范式，转向集成的、连续的行动。这一转变对工业机器人操作具有重要意义，承诺将人类工人从重复且危险的日常劳动中解放出来。为了基准测试和推动这一能力的发展，我们引入了机器人协作组装辅助（RoCo）挑战赛，并提供了一个用于仿真和现实世界组装操作的数据集。在以人为本的制造背景下，该挑战聚焦于高精度行星齿轮箱组装任务，这是一项在现代工业中既具有挑战性又高度代表性的操作。该挑战基于在Isaac Sim中自开发的数据收集、训练和评估系统，并利用双臂机器人进行现实世界的部署，分为两个阶段进行。仿真轮次定义了细粒度的任务阶段，以逐步评分的方式处理组装的长期特性。现实世界轮次则使用物理齿轮箱组件和高质量的远程操作数据集来反映这一评估。核心任务要求从零开始组装一个行星齿轮箱，包括安装三个行星齿轮、一个太阳齿轮和一个环齿轮。该挑战吸引了来自10多个国家的60多个团队和170多名参与者，产生了高效的解决方案，尤其是ARC-VLA和RoboCola。结果表明，针对长期多任务学习的双模型框架非常有效，而从失败中恢复的课程数据的战略性利用是成功部署的关键洞察。本报告概述了比赛设置、评估方法、主要发现以及工业EAI的未来方向。我们的数据集、CAD文件、代码和评估结果可在以下网址找到：https://rocochallenge.github.io/RoCo2026/。

View on arXiv Download PDF AI Translation

cs.RO / 107 / 2603.15471

On the Derivation of Tightly-Coupled LiDAR-Inertial Odometry with VoxelMap

基于体素地图的紧耦合激光雷达-惯性测程的推导

Zhan, Zhihao

Abstract

This note presents a concise mathematical formulation of tightly-coupled LiDAR-Inertial Odometry within an iterated error-state Kalman filter framework using a VoxelMap representation. Rather than proposing a new algorithm, it provides a clear and self-contained derivation that unifies the geometric modeling and probabilistic state estimation through consistent notation and explicit formulations. The document is intended to serve both as a technical reference and as an accessible entry point for a foundational understanding of the system architecture and estimation principles.

Chinese Translation

本文在使用体素地图表示的迭代误差状态卡尔曼滤波框架内，提出了一种简洁的紧耦合激光雷达-惯性测程的数学公式推导。本文不是提出一种新的算法，而是提供了一种清晰且自包含的推导，统一了几何建模和概率状态估计，采用一致的符号和明确的公式。该文档旨在作为技术参考，同时为系统架构和估计原理的基础理解提供一个易于接触的入门点。

View on arXiv Download PDF AI Translation

cs.RO / 108 / 2603.15528

Optimal control of differentially flat underactuated planar robots in the perspective of oscillation mitigation

从振荡抑制的角度看差分平坦欠驱动平面机器人的最优控制

Lovato, Stefano, Tonan, Michele, Bottin, Matteo, Massaro, Matteo, Doria, Alberto, Rosati, Giulio

Abstract

Underactuated robots are characterized by a larger number of degrees of freedom than actuators and if they are designed with a specific mass distribution, they can be controlled by means of differential flatness theory. This structural property enables the development of lightweight and cost-effective robotic systems with enhanced dexterity. However, a key challenge lies in managing the passive joints, whose control demands precise and comprehensive dynamic modeling of the system. To simplify dynamic models, particularly for low-speed trajectories, friction is often neglected. While this assumption simplifies analysis and control design, it introduces residual oscillations of the end-effector about the target position. In this paper, the possibility of using optimal control along with differential flatness control is investigated to improve the tracking of the planned trajectories. First, the study was carried out through formal analysis, and then, it was validated by means of numerical simulations. Results highlight that optimal control can be used to plan the flat variables considering different (quadratic) performance indices: control effort, i.e. motor torque, and potential energy of the considered underactuated joint. Moreover, the minimization of potential energy can be used to design motion laws that are robust against variation of the stiffness and damping of the underactuated joint, thus reducing oscillations in the case of stiffness/damping mismatch.

Chinese Translation

欠驱动机器人具有自由度大于驱动器的特征，如果它们的质量分布设计得当，可以通过差分平坦性理论进行控制。这一结构特性使得开发轻量且经济高效的机器人系统成为可能，从而增强了灵活性。然而，一个主要挑战在于管理被动关节，其控制需要对系统进行精确和全面的动态建模。为了简化动态模型，特别是在低速轨迹下，通常会忽略摩擦。虽然这一假设简化了分析和控制设计，但它引入了末端执行器围绕目标位置的残余振荡。本文探讨了结合最优控制与差分平坦控制的可能性，以改善规划轨迹的跟踪。首先，通过形式分析进行了研究，随后通过数值仿真进行了验证。结果表明，最优控制可以用于规划平坦变量，考虑不同的（平方）性能指标：控制努力，即电机扭矩，以及所考虑的欠驱动关节的势能。此外，势能的最小化可以用于设计对欠驱动关节的刚度和阻尼变化具有鲁棒性的运动规律，从而在刚度/阻尼不匹配的情况下减少振荡。

View on arXiv Download PDF AI Translation

cs.RO / 109 / 2603.15600

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

从被动观察者到主动批评者：强化学习引导机器人操作的过程推理

Liu, Yibin, Lyu, Yaxing, Gao, Daqi, Liang, Zhixuan, Tang, Weiliang, Mu, Shilong, Yang, Xiaokang, Mu, Yao

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

Chinese Translation

准确的过程监督仍然是长时间跨度机器人操作的一项关键挑战。一个主要瓶颈是，目前的视频多模态学习模型（MLLMs）主要在监督微调（SFT）范式下训练，作为被动的“观察者”来识别正在发生的事件，而不是评估当前状态与最终任务目标的关系。本文介绍了PRIMO R1（过程推理引导监控），这是一个7B框架，将视频MLLMs转变为主动的“批评者”。我们利用基于结果的强化学习来激励显式的思维链生成，以进行进度估计。此外，我们的架构通过在初始状态和当前状态图像之间显式锚定视频序列，构建了一个结构化的时间输入。得益于所提出的PRIMO数据集和基准，在各种领域内环境和领域外真实人形场景中的广泛实验表明，PRIMO R1实现了最先进的性能。从定量上看，我们的7B模型在专门推理基线的平均绝对误差上减少了50%，显示出相对于72B规模的通用MLLMs显著的相对准确性提升。此外，PRIMO R1在困难的故障检测任务上表现出强大的零样本泛化能力。我们在RoboFail基准上建立了67.0%的准确率，超越了如OpenAI o1等闭源模型6.0%。

View on arXiv Download PDF AI Translation

cs.RO / 110 / 2603.15604

EAAE: Energy-Aware Autonomous Exploration for UAVs in Unknown 3D Environments

EAAE：面向未知三维环境的能量感知自主探索无人机

Elskamp, Jacob, Shi, Moji, Bauersfeld, Leonard, Scaramuzza, Davide, Popović, Marija

Abstract

Battery-powered multirotor unmanned aerial vehicles (UAVs) can rapidly map unknown environments, but mission performance is often limited by energy rather than geometry alone. Standard exploration policies that optimise for coverage or time can therefore waste energy through manoeuvre-heavy trajectories. In this paper, we address energy-aware autonomous 3D exploration for multirotor UAVs in initially unknown environments. We propose Energy-Aware Autonomous Exploration (EAAE), a modular frontier-based framework that makes energy an explicit decision variable during frontier selection. EAAE clusters frontiers into view-consistent regions, plans dynamically feasible candidate trajectories to the most informative clusters, and predicts their execution energy using an offline power estimation loop. The next target is then selected by minimising predicted trajectory energy while preserving exploration progress through a dual-layer planning architecture for safe execution. We evaluate EAAE in a full exploration pipeline with a rotor-speed-based power model across simulated 3D environments of increasing complexity. Compared to representative distance-based and information gain-based frontier baselines, EAAE consistently reduces total energy consumption while maintaining competitive exploration time and comparable map quality, providing a practical drop-in energy-aware layer for frontier exploration.

Chinese Translation

电池供电的多旋翼无人机（UAV）能够快速绘制未知环境的地图，但任务性能往往受到能量而非几何形状的限制。因此，优化覆盖率或时间的标准探索策略可能会通过复杂的机动轨迹浪费能量。本文针对最初未知环境中的多旋翼无人机进行能量感知的自主三维探索。我们提出了能量感知自主探索（EAAE），这是一种模块化的基于前沿的框架，在前沿选择过程中将能量作为显式决策变量。EAAE将前沿聚类为视图一致的区域，规划动态可行的候选轨迹以到达信息量最大的聚类，并使用离线功率估计循环预测其执行能量。然后，通过双层规划架构在保持探索进展的同时最小化预测轨迹能量，从而选择下一个目标，以确保安全执行。我们在一个完整的探索管道中评估EAAE，使用基于转子速度的功率模型，针对复杂度逐渐增加的模拟三维环境进行测试。与代表性的基于距离和信息增益的前沿基线相比，EAAE始终减少总能量消耗，同时保持竞争性的探索时间和可比的地图质量，为前沿探索提供了一个实用的能量感知层。

View on arXiv Download PDF AI Translation

cs.RO / 111 / 2603.15605

Perception-Aware Autonomous Exploration in Feature-Limited Environments

感知意识的自主探索在特征有限环境中的应用

Shi, Moji, de Silva, Rajitha, Yu, Hang, Polvara, Riccardo, Popović, Marija

Abstract

Autonomous exploration in unknown environments typically relies on onboard state estimation for localisation and mapping. Existing exploration methods primarily maximise coverage efficiency, but often overlook that visual-inertial odometry (VIO) performance strongly depends on the availability of robust visual features. As a result, exploration policies can drive a robot into feature-sparse regions where tracking degrades, leading to odometry drift, corrupted maps, and mission failure. We propose a hierarchical perception-aware exploration framework for a stereo-equipped unmanned aerial vehicle (UAV) that explicitly couples exploration progress with feature observability. Our approach (i) associates each candidate frontier with an expected feature quality using a global feature map, and prioritises visually informative subgoals, and (ii) optimises a continuous yaw trajectory along the planned motion to maintain stable feature tracks. We evaluate our method in simulation across environments with varying texture levels and in real-world indoor experiments with largely textureless walls. Compared to baselines that ignore feature quality and/or do not optimise continuous yaw, our method maintains more reliable feature tracking, reduces odometry drift, and achieves on average 30\% higher coverage before the odometry error exceeds specified thresholds.

Chinese Translation

在未知环境中的自主探索通常依赖于机载状态估计进行定位和地图构建。现有的探索方法主要最大化覆盖效率，但往往忽视了视觉惯性测距（VIO）性能在很大程度上依赖于稳健视觉特征的可用性。因此，探索策略可能会将机器人引导至特征稀疏区域，在这些区域跟踪性能下降，导致里程计漂移、地图损坏和任务失败。我们提出了一种层次化的感知意识探索框架，适用于配备立体视觉的无人机（UAV），该框架明确将探索进展与特征可观测性相结合。我们的方法（i）使用全局特征图将每个候选前沿与预期特征质量关联，并优先考虑视觉信息丰富的子目标；（ii）优化沿计划运动的连续偏航轨迹，以保持稳定的特征跟踪。我们在不同纹理水平的环境中进行了仿真评估，并在实际室内实验中测试了大部分无纹理墙壁的情况。与忽略特征质量和/或不优化连续偏航的基线方法相比，我们的方法保持了更可靠的特征跟踪，减少了里程计漂移，并在里程计误差超过指定阈值之前实现了平均30%的更高覆盖率。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

300

cs.CV / 1 / 2603.13238

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

KazakhOCR：用于评估低资源哈萨克文字符识别的多模态模型的合成基准

Gagnier, Henry, Gagnier, Sophie, Kirubakaran, Ashwin

Abstract

Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.

Chinese Translation

哈萨克语是一种使用阿拉伯文、斯拉夫文和拉丁文的突厥语言，这使其在光学字符识别（OCR）方面具有独特性。针对低资源哈萨克文的OCR研究非常稀缺，且目前尚无阿拉伯文和拉丁文的OCR基准或图像。我们构建了一个包含7,219张图像的合成OCR数据集，涵盖三种文字，具有字体、颜色和噪声的变化，以模拟真实的OCR任务。我们在OCR和语言识别的基准子集上评估了三种多模态大型语言模型（MLLMs）：Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct和Llama-3.2-11B-Vision-Instruct。所有模型在拉丁文和阿拉伯文的OCR任务中均未成功，且未能将阿拉伯文识别为哈萨克文本，而是错误地将其分类为阿拉伯文、波斯文和库尔德文。我们进一步将MLLMs与传统OCR基线进行了比较，发现尽管传统OCR的字符错误率较低，但MLLMs未能达到这一性能。这些发现显示了当前MLLM在处理低资源阿布贾德（Abjad）基础文字方面的显著差距，并证明了需要支持低资源文字和语言的包容性模型和基准。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2603.13240

Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field

无注释手语翻译：对该领域进展的无偏评估

Sincan, Ozge Mercanoglu, Low, Jian He, Asasi, Sobhan, Bowden, Richard

Abstract

Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.

Chinese Translation

手语翻译（Sign Language Translation, SLT）旨在自动将视觉手语视频转换为口语文本，反之亦然。尽管近年来取得了快速进展，但性能提升的真实来源往往不明确。报告的性能提升是源于方法的新颖性，还是选择了不同的主干网络、训练优化、超参数调优，甚至是评估指标计算的差异？本文通过在统一的代码库中重新实现关键贡献，全面研究了最近的无注释SLT模型。我们通过标准化预处理、视频编码器和训练设置，确保所有方法之间的公平比较。我们的分析表明，文献中报告的许多性能提升在模型在一致条件下评估时往往会减弱，这表明实现细节和评估设置在确定结果中起着重要作用。我们在此公开代码库（https://github.com/ozgemercanoglu/sltbaselines），以支持SLT研究的透明性和可重复性。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2603.13300

Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation

安全引导流（SGF）：安全生成中负引导的统一框架

Kim, Mingyu, Kim, Young-Heon, Park, Mijung

Abstract

Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data-driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy-based negative guidance against unsafe data samples. Furthermore, we leverage control-barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high-quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.

Chinese Translation

近年来，扩散和流模型的安全机制沿着两条不同的路径发展。在机器人规划中，控制屏障函数被用来在每个去噪步骤中通过明确施加几何约束，引导生成轨迹远离障碍物。同时，最近的数据驱动负引导方法已被证明能够抑制有害内容并促进生成样本的多样性。然而，这些方法依赖于启发式规则，并未明确说明何时实际上需要安全引导。在本文中，我们首先引入一个统一的概率框架，使用最大均值差异（Maximum Mean Discrepancy, MMD）潜力针对图像生成任务，将屏蔽扩散（Shielded Diffusion）和安全去噪器（Safe Denoiser）重新表述为我们针对不安全数据样本的基于能量的负引导的实例。此外，我们利用控制屏障函数分析来证明在一个关键时间窗口内负引导必须保持强度；在这个窗口之外，引导应衰减至零，以确保安全和高质量的生成。我们在多个现实的安全生成场景中评估了我们的统一框架，确认负引导应在去噪过程的早期阶段应用，以实现成功的安全生成。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2603.13306

Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision

针对弱监督下的剪辑级监控异常检测的紧凑型视觉语言模型基准测试

Borodin, Kirill, Kondrashov, Kirill, Vasiliev, Nikita, Gladkova, Ksenia, Larina, Inna, Gorodnichev, Mikhail, Mkrtchian, Grach

Abstract

CCTV safety monitoring demands anomaly detectors combine reliable clip-level accuracy with predictable per-clip latency despite weak supervision. This work investigates compact vision-language models (VLMs) as practical detectors for this regime. A unified evaluation protocol standardizes preprocessing, prompting, dataset splits, metrics, and runtime settings to compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and weakly supervised baselines. Evaluation spans accuracy, precision, recall, F1, ROC-AUC, and average per-clip latency to jointly quantify detection quality and efficiency. With parameter-efficient adaptation, compact VLMs achieve performance on par with, and in several cases exceeding, established approaches while retaining competitive per-clip latency. Adaptation further reduces prompt sensitivity, producing more consistent behavior across prompt regimes under the shared protocol. These results show that parameter-efficient fine-tuning enables compact VLMs to serve as dependable clip-level anomaly detectors, yielding a favorable accuracy-efficiency trade-off within a transparent and consistent experimental setup.

Chinese Translation

闭路电视（CCTV）安全监控要求异常检测器在弱监督的情况下，结合可靠的剪辑级准确性和可预测的每剪辑延迟。本研究探讨了紧凑型视觉语言模型（VLMs）作为该领域实用检测器的可行性。统一的评估协议标准化了预处理、提示、数据集划分、指标和运行时设置，以便将参数高效适配的紧凑型 VLMs 与无训练的 VLM 流水线和弱监督基线进行比较。评估涵盖准确性、精确率、召回率、F1 值、ROC-AUC 和平均每剪辑延迟，以共同量化检测质量和效率。通过参数高效的适配，紧凑型 VLMs 在性能上与已建立的方法相当，并在多个案例中超越了这些方法，同时保持竞争力的每剪辑延迟。适配进一步降低了对提示的敏感性，在共享协议下产生了更一致的行为。这些结果表明，参数高效的微调使得紧凑型 VLMs 能够作为可靠的剪辑级异常检测器，在透明且一致的实验设置中实现了良好的准确性与效率的权衡。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2603.13335

Information-Theoretic Constraints for Continual Vision-Language-Action Alignment

持续视觉-语言-动作对齐的信息论约束

Zhao, Libang, Zeng, Qixin, Zhang, Hongyin, Wang, Donglin

Abstract

When deployed in open-ended robotic environments, Vision--Language--Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.

Chinese Translation

在开放式机器人环境中部署时，视觉-语言-动作（VLA）模型需要不断获取新技能，但却遭受严重的灾难性遗忘。我们观察到，这种退化与跨模态信息结构的恶化有关，其中视觉观察、语言指令和动作之间的依赖关系在持续适应过程中逐渐扩散。然而，现有的持续学习方法未能保持这种跨模态信息依赖。因此，我们提出了Info-VLA，这是一种信息保持的持续学习框架，通过两个互补的约束来维护跨模态信息结构。重放锚点对比学习（Replay Anchor Contrastive Learning）从一个冻结的教师模型构建稳定的对齐锚点，保持表示空间中的跨模态对齐。跨模态互信息最大化（Cross-Modal Mutual Information Maximization）进一步通过互信息约束保持视觉和语言表示之间的依赖结构。通过共同保持历史对齐和跨模态依赖信息，Info-VLA在持续学习过程中平衡了稳定性和可塑性。此外，在LIBERO上的实验表明，Info-VLA在任务保留和适应性方面显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2603.13337

MultiSolSegment: Multi-channel segmentation of overlapping features in electroluminescence images of photovoltaic cells

MultiSolSegment：光伏电池电致发光图像中重叠特征的多通道分割

Sanghi, Ojas, Jost, Norman, Pierce, Benjamin G., Cooper, Emma, Deane, Isaiah H., Braid, Jennifer L.

Abstract

Electroluminescence (EL) imaging is widely used to detect defects in photovoltaic (PV) modules, and machine learning methods have been applied to enable large-scale analysis of EL images. However, existing methods cannot assign multiple labels to the same pixel, limiting their ability to capture overlapping degradation features. We present a multi-channel U-Net architecture for pixel-level multi-label segmentation of EL images. The model outputs independent probability maps for cracks, busbars, dark areas, and non-cell regions, enabling accurate co-classification of interacting features such as cracks crossing busbars. The model achieved an accuracy of 98% and has been shown to generalize to unseen datasets. This framework offers a scalable, extensible tool for automated PV module inspection, improving defect quantification and lifetime prediction in large-scale PV systems.

Chinese Translation

电致发光（EL）成像广泛用于检测光伏（PV）模块中的缺陷，机器学习方法已被应用于实现对EL图像的大规模分析。然而，现有方法无法为同一像素分配多个标签，限制了其捕捉重叠退化特征的能力。我们提出了一种多通道U-Net架构，用于EL图像的像素级多标签分割。该模型为裂纹、母线、暗区和非电池区域输出独立的概率图，从而实现了对交互特征（如穿过母线的裂纹）的准确共同分类。该模型达到了98%的准确率，并已证明能够推广到未见数据集。该框架提供了一种可扩展、可扩展的自动化PV模块检测工具，提高了大规模PV系统中缺陷量化和寿命预测的能力。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2603.13340

Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition

互补性监督的光谱带路由用于多模态情感识别

Huang, Zhexian, Zhao, Bo, Ma, Hui, Liu, Zhishu, Zhang, Jie, Zhang, Ruixin, Ding, Shouhong, Yu, Zitong

Abstract

Multimodal emotion recognition fuses cues such as text, video, and audio to understand individual emotional states. Prior methods face two main limitations: mechanically relying on independent unimodal performance, thereby missing genuine complementary contributions, and coarse-grained fusion conflicting with the fine-grained representations required by emotion tasks. As inconsistent information density across heterogeneous modalities hinders inter-modal feature mining, we propose the Complementarity-Supervised Multi-Band Expert Network, named Atsuko, to model fine-grained complementary features via multi-scale band decomposition and expert collaboration. Specifically, we orthogonally decompose each modality's features into high, mid, and low-frequency components. Building upon this band-level routing, we design a modality-level router with a dual-path mechanism for fine-grained cross-band selection and cross-modal fusion. To mitigate shortcut learning from dominant modalities, we propose the Marginal Complementarity Module (MCM) to quantify performance loss when removing each modality via bi-modal comparison. The resulting complementarity distribution provides soft supervision, guiding the router to focus on modalities contributing unique information gains. Extensive experiments show our method achieves superior performance on the CMU-MOSI, CMU-MOSEI, CH-SIMS, CH-SIMSv2, and MIntRec benchmarks.

Chinese Translation

多模态情感识别融合文本、视频和音频等线索，以理解个体的情感状态。现有方法面临两个主要限制：机械地依赖独立的单模态性能，从而错过真正的互补贡献，以及粗粒度融合与情感任务所需的细粒度表示相冲突。由于异构模态间信息密度的不一致妨碍了跨模态特征挖掘，我们提出了互补性监督的多带专家网络（Complementarity-Supervised Multi-Band Expert Network），命名为Atsuko，以通过多尺度带分解和专家协作建模细粒度互补特征。具体而言，我们将每种模态的特征正交分解为高、中和低频成分。在此带级路由的基础上，我们设计了一个模态级路由器，采用双路径机制进行细粒度的跨带选择和跨模态融合。为了减轻来自主导模态的快捷学习，我们提出了边际互补性模块（Marginal Complementarity Module，MCM），通过双模态比较量化去除每种模态时的性能损失。由此产生的互补性分布提供了软监督，指导路由器关注那些贡献独特信息增益的模态。大量实验表明，我们的方法在CMU-MOSI、CMU-MOSEI、CH-SIMS、CH-SIMSv2和MIntRec基准测试中表现优越。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2603.13341

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

注意源无关跨域少样本学习中的可辨别性陷阱

Zhang, Zhenyu, Zou, Yixiong, Li, Yuhua, Li, Ruixuan, Chen, Guangyao

Abstract

Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.

Chinese Translation

源无关跨域少样本学习（Source-Free Cross-Domain Few-Shot Learning, SF-CDFSL）关注于使用来自目标域（例如医学或卫星图像）的有限训练数据进行微调，其中视觉-语言模型（Vision-Language Models, VLMs）如CLIP和SigLIP已显示出良好的效果。当前传统视觉模型的研究表明，提高视觉可辨别性可以增强性能。然而，在基于VLM的SF-CDFSL任务中，我们发现 extbf{增强视觉模态的可辨别性实际上抑制了VLM的性能}。本文旨在深入探讨这一现象，以提供解释和解决方案。通过理论和实验的证明，我们的研究揭示了使用典型的交叉熵损失（$ extmath{L}_{ ext{vlm}}$）进行微调本质上包括了视觉学习部分和跨模态学习部分，其中跨模态部分对于纠正SF-CDFSL中严重干扰的模态不对齐至关重要。然而，我们发现视觉学习本质上充当了一种捷径，促使模型在不考虑跨模态部分的情况下减少$ extmath{L}_{ ext{vlm}}$，从而阻碍了跨模态对齐并损害了性能。基于这一解释，我们进一步提出了一种解决该问题的方法：首先，我们扰动视觉学习，以引导模型关注跨模态对齐。然后，我们利用视觉-文本语义关系在微调过程中逐步对齐视觉和文本模态。在各种设置、主干网络（CLIP、SigLip、PE-Core）和任务（4个CDFSL数据集和11个FSL数据集）上的大量实验表明，我们始终设定了新的最先进结果。代码可在https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap获取。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2603.13345

DDS-UDA: Dual-Domain Synergy for Unsupervised Domain Adaptation in Joint Segmentation of Optic Disc and Optic Cup

DDS-UDA：用于视盘和视杯联合分割的无监督领域适应的双领域协同

Xiao, Yusong, Wu, Yuxuan, Xiao, Li, Qu, Gang, Huo, Haiye, Wang, Yu-Ping

Abstract

Convolutional neural networks (CNNs) have achieved exciting performance in joint segmentation of optic disc and optic cup on single-institution datasets. However, their clinical translation is hindered by two major challenges: limited availability of large-scale, high-quality annotations and performance degradation caused by domain shift during deployment across heterogeneous imaging protocols and acquisition platforms. While unsupervised domain adaptation (UDA) provides a way to mitigate these limitations, most existing approaches do not address cross-domain interference and intra-domain generalization within a unified framework. In this paper, we present the Dual-Domain Synergy UDA (DDS-UDA), a novel UDA framework that comprises two key modules. First, a bi-directional cross-domain consistency regularization module is enforced to mitigate cross-domain interference through feature-level semantic information exchange guided by a coarse-to-fine dynamic mask generator, suppressing noise propagation while preserving structural coherence. Second, a frequency-driven intra-domain pseudo label learning module is used to enhance intra-domain generalization by synthesizing spectral amplitude-mixed supervision signals, which ensures high-fidelity feature alignment across domains. Implemented within a teacher-student architecture, DDS-UDA disentangles domain-specific biases from domain-invariant feature-level representations, thereby achieving robust adaptation to heterogeneous imaging environments. We conduct a comprehensive evaluation of our proposed method on two multi-domain fundus image datasets, demonstrating that it outperforms several existing UDA based methods and therefore providing an effective way for optic disc and optic cup segmentation.

Chinese Translation

卷积神经网络（CNN）在单一机构数据集上实现了视盘和视杯联合分割的令人振奋的性能。然而，其临床转化受到两个主要挑战的阻碍：大规模高质量标注的有限可用性以及在异构成像协议和采集平台上部署时因领域转移导致的性能下降。虽然无监督领域适应（UDA）提供了一种缓解这些限制的方法，但大多数现有方法未能在统一框架内解决跨领域干扰和领域内泛化的问题。本文提出了双领域协同无监督领域适应（DDS-UDA），这是一种新颖的UDA框架，包含两个关键模块。首先，强制执行双向跨领域一致性正则化模块，通过粗到细的动态掩模生成器引导的特征级语义信息交换来减轻跨领域干扰，抑制噪声传播，同时保持结构一致性。其次，使用频率驱动的领域内伪标签学习模块，通过合成光谱幅度混合的监督信号来增强领域内泛化，确保跨领域的高保真特征对齐。在教师-学生架构中实现，DDS-UDA将领域特定偏差与领域不变特征级表示解耦，从而实现对异构成像环境的稳健适应。我们在两个多领域眼底图像数据集上对所提方法进行了全面评估，结果表明其优于几种现有的基于UDA的方法，因此为视盘和视杯分割提供了一种有效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2603.13346

Post Training Quantization for Efficient Dataset Condensation

高效数据集浓缩的后训练量化

Tran, Linh-Tam, Bae, Sung-Ho

Abstract

Dataset Condensation (DC) distills knowledge from large datasets into smaller ones, accelerating training and reducing storage requirements. However, despite notable progress, prior methods have largely overlooked the potential of quantization for further reducing storage costs. In this paper, we take the first step to explore post-training quantization in dataset condensation, demonstrating its effectiveness in reducing storage size while maintaining representation quality without requiring expensive training cost. However, we find that at extremely low bit-widths (e.g., 2-bit), conventional quantization leads to substantial degradation in representation quality, negatively impacting the networks trained on these data. To address this, we propose a novel \emph{patch-based post-training quantization} approach that ensures localized quantization with minimal loss of information. To reduce the overhead of quantization parameters, especially for small patch sizes, we employ quantization-aware clustering to identify similar patches and subsequently aggregate them for efficient quantization. Furthermore, we introduce a refinement module to align the distribution between original images and their dequantized counterparts, compensating for quantization errors. Our method is a plug-and-play framework that can be applied to synthetic images generated by various DC methods. Extensive experiments across diverse benchmarks including CIFAR-10/100, Tiny ImageNet, and ImageNet subsets demonstrate that our method consistently outperforms prior works under the same storage constraints. Notably, our method nearly \textbf{doubles the test accuracy} of existing methods at extreme compression regimes (e.g., 26.0\% $\rightarrow$ 54.1\% for DM at IPC=1), while operating directly on 2-bit images without additional distillation.

Chinese Translation

数据集浓缩（Dataset Condensation, DC）将大数据集中的知识提炼到较小的数据集中，从而加速训练并减少存储需求。然而，尽管取得了显著进展，之前的方法在进一步降低存储成本方面大多忽视了量化的潜力。本文首次探索了数据集浓缩中的后训练量化，展示了其在保持表示质量的同时有效减少存储大小的能力，而无需昂贵的训练成本。然而，我们发现，在极低的比特宽度（例如，2位）下，传统量化会导致表示质量显著下降，负面影响在这些数据上训练的网络。为了解决这个问题，我们提出了一种新颖的 extit{基于补丁的后训练量化}方法，确保局部量化时信息损失最小。为了减少量化参数的开销，特别是对于小补丁尺寸，我们采用量化感知聚类来识别相似的补丁，并随后将其聚合以实现高效量化。此外，我们引入了一个精细化模块，以对齐原始图像与其去量化对应物之间的分布，补偿量化误差。我们的方法是一个即插即用的框架，可以应用于各种DC方法生成的合成图像。在包括CIFAR-10/100、Tiny ImageNet和ImageNet子集在内的多种基准测试中进行的广泛实验表明，我们的方法在相同存储限制下始终优于之前的工作。值得注意的是，我们的方法在极端压缩情况下（例如，DM在IPC=1时从26.0\%提升至54.1\%）几乎 extbf{翻倍了现有方法的测试准确率}，同时直接在2位图像上操作，无需额外的蒸馏。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2603.13349

MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval

MURE：通过视觉-语言模型进行视觉文档检索的分层多分辨率编码

Zhu, Fengbin, Cai, Zijing, Wang, Yuzhe, Shao, Pengyang, Wang, Wenjie, Feng, Fuli, Hong, Richang, Chua, Tat-Seng

Abstract

Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matryoshka representation learning (RMRL) for effective feature fusion, and applies a semantic-aware hierarchical clustering mechanism for visual token compression. Experiments on two widely used VDR benchmarks show that our MURE framework consistently beats strong baselines. Furthermore, it significantly outperforms ColPali with only 50% of its visual token budget.

Chinese Translation

视觉文档检索（VDR）需要能够捕捉细粒度视觉细节和全局文档结构的表示，以确保检索效率，同时保持计算效率。现有的 VDR 模型在处理高分辨率文档时难以平衡效果与效率：它们往往要么丢失细粒度信息，要么生成过多的视觉标记，从而导致显著的索引开销和高检索延迟。在本研究中，我们重新思考视觉编码机制，提出了一种新的 X-VisEmb 范式，该范式从多分辨率采样和编码，经过跨粒度特征融合，到自适应表示蒸馏。初步研究验证了其在不同尺度下捕捉互补视觉线索的可行性和有效性。在此基础上，我们开发了 MURE，一个新颖的框架，利用 VLMs 作为分层多分辨率编码器，整合分辨率级 Matryoshka 表示学习（RMRL）以实现有效的特征融合，并应用语义感知的分层聚类机制进行视觉标记压缩。在两个广泛使用的 VDR 基准上的实验表明，我们的 MURE 框架始终优于强基线。此外，它在视觉标记预算仅为 ColPali 的 50% 时，显著超越了 ColPali 的性能。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2603.13352

Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

局部精确细化：一种双门控专家混合模型用于增强基础模型在光谱变化下的泛化能力

Chen, Xi, Zhang, Maojun, Liu, Yu, Yan, Shen

Abstract

Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

Chinese Translation

光谱遥感中的领域泛化语义分割（DGSS）面临着来自多样化获取条件的光谱变化的严重挑战，这导致在未见领域中部署的模型性能显著下降。尽管对基础模型进行参数高效微调（PEFT）是一种有前景的方向，但现有方法采用全球、同质的调整。这种“一刀切”的调优方法难以应对土地覆盖的空间异质性，造成语义混淆。我们认为，鲁棒的DGSS的关键不在于单一的全球适应，而在于对基础模型特征进行细粒度、空间自适应的细化。为此，我们提出了SpectralMoE，一种用于DGSS的新型PEFT框架。它通过利用专家混合（MoE）架构对基础模型的特征进行局部精确细化，结合从选定的光谱遥感图像RGB波段估计的深度特征来指导微调过程。具体而言，SpectralMoE采用双门控MoE架构，独立地将视觉和深度特征路由到前k个选定的专家进行专业化细化，从而实现模态特定的调整。随后，交叉注意机制明智地将细化后的结构线索融合到视觉流中，减轻光谱变化引起的语义模糊。大量实验表明，SpectralMoE在多个DGSS基准测试中设立了新的最先进水平，涵盖超光谱、多光谱和RGB遥感图像。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2603.13354

AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

AgriPath：作物疾病分类架构权衡的系统探索

Mooraj, Hamza, Pantazopoulos, George, Suglia, Alessandro

Abstract

Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.

Chinese Translation

可靠的作物疾病检测需要在多样的采集条件下表现一致的模型，但现有评估往往集中于单一架构家族或实验室生成的数据集。本研究提出了三种模型范式在细粒度作物疾病分类中的系统实证比较：卷积神经网络（CNNs）、对比视觉-语言模型（VLMs）和生成性VLMs。为了实现对领域效应的控制分析，我们引入了AgriPath-LF16，这是一个包含111k张图像的基准数据集，涵盖16种作物和41种疾病，并明确区分实验室和田间图像，同时提供一个平衡的30k子集用于标准化训练和评估。所有模型在统一的协议下进行训练和评估，涵盖完整、仅实验室和仅田间的训练模式，使用宏F1和解析成功率（PSR）来考虑生成可靠性。结果显示出不同的性能特征。CNNs在实验室图像上达到最高准确率，但在领域转移时表现下降。对比VLMs提供了一种稳健且参数高效的替代方案，具有竞争力的跨领域性能。生成性VLMs在分布变化中表现出最强的韧性，尽管由于自由文本生成而出现额外的失败模式。这些发现强调，架构选择应根据部署背景而非仅仅依赖于整体准确率来指导。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2603.13355

Int3DNet: Scene-Motion Cross Attention Network for 3D Intention Prediction in Mixed Reality

Int3DNet：用于混合现实中3D意图预测的场景运动交叉注意力网络

Ha, Taewook, Cho, Woojin, Kim, Dooyoung, Woo, Woontack

Abstract

We propose Int3DNet, a scene-aware network that predicts 3D intention areas directly from scene geometry and head-hand motion cues, enabling robust human intention prediction without explicit object-level perception. In Mixed Reality (MR), intention prediction is critical as it enables the system to anticipate user actions and respond proactively, reducing interaction delays and ensuring seamless user experiences. Our method employs a cross attention fusion of sparse motion cues and scene point clouds, offering a novel approach that directly interprets the user's spatial intention within the scene. We evaluated Int3DNet on MoGaze and CIRCLE datasets, which are public datasets for full-body human-scene interactions, showing consistent performance across time horizons of up to 1500 ms and outperforming the baselines, even in diverse and unseen scenes. Moreover, we demonstrate the usability of proposed method through a demonstration of efficient visual question answering (VQA) based on intention areas. Int3DNet provides reliable 3D intention areas derived from head-hand motion and scene geometry, thus enabling seamless interaction between humans and MR systems through proactive processing of intention areas.

Chinese Translation

我们提出了Int3DNet，这是一种场景感知网络，能够直接从场景几何和头手运动线索中预测3D意图区域，从而实现稳健的人类意图预测，而无需显式的对象级感知。在混合现实（MR）中，意图预测至关重要，因为它使系统能够预见用户的行为并主动响应，从而减少交互延迟，确保无缝的用户体验。我们的方法采用稀疏运动线索和场景点云的交叉注意力融合，提供了一种新颖的方法，能够直接解释用户在场景中的空间意图。我们在MoGaze和CIRCLE数据集上评估了Int3DNet，这些数据集是用于全身人类与场景交互的公共数据集，显示出在高达1500毫秒的时间范围内的一致性能，并且在多样化和未见过的场景中超越了基线。此外，我们通过基于意图区域的高效视觉问答（VQA）演示展示了所提方法的可用性。Int3DNet提供了可靠的3D意图区域，这些区域源自头手运动和场景几何，从而通过主动处理意图区域，实现人类与MR系统之间的无缝交互。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2603.13357

Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection

Bi-CamoDiffusion：一种边界信息驱动的伪装物体检测扩散方法

Suarez, Patricia L., Ramos, Leo Thomas, Sappa, Angel D.

Abstract

Bi-CamoDiffusion is introduced, an evolution of the CamoDiffusion framework for camouflaged object detection. It integrates edge priors into early-stage embeddings via a parameter-free injection process, which enhances boundary sharpness and prevents structural ambiguity. This is governed by a unified optimization objective that balances spatial accuracy, structural constraints, and uncertainty supervision, allowing the model to capture of both the object's global context and its intricate boundary transitions. Evaluations across the CAMO, COD10K, and NC4K benchmarks show that Bi-CamoDiffusion surpasses the baseline, delivering sharper delineation of thin structures and protrusions while also minimizing false positives. Also, our model consistently outperforms existing state-of-the-art methods across all evaluated metrics, including $S_m$, $F_{\beta}^{w}$, $E_m$, and $MAE$, demonstrating a more precise object-background separation and sharper boundary recovery.

Chinese Translation

本文介绍了Bi-CamoDiffusion，这是CamoDiffusion框架在伪装物体检测中的演变。它通过无参数的注入过程将边缘先验整合到早期嵌入中，从而增强边界清晰度并防止结构模糊。这一过程由统一的优化目标驱动，平衡空间精度、结构约束和不确定性监督，使模型能够捕捉对象的全局上下文及其复杂的边界过渡。针对CAMO、COD10K和NC4K基准的评估显示，Bi-CamoDiffusion超过了基线，实现了对薄结构和突起的更清晰描绘，同时最小化假阳性。此外，我们的模型在所有评估指标上，包括$S_m$、$F_{eta}^{w}$、$E_m$和$MAE$，均持续优于现有的最新技术方法，展现了更精确的物体与背景分离及更清晰的边界恢复。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2603.13360

Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Graph2Video：利用视频模型建模动态图演化

Liu, Hua, Wei, Yanbin, Xing, Fei, Derr, Tyler, Han, Haoyu, Zhang, Yu

Abstract

Dynamic graphs are common in real-world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the complexity of temporal evolution. They tend to overlook fine-grained variations in temporal interaction order, struggle with dependencies that span long time horizons, and offer limited capability to model pair-specific relational dynamics. To address these challenges, we propose \textbf{Graph2Video}, a video-inspired framework that views the temporal neighborhood of a target link as a sequence of "graph frames". By stacking temporally ordered subgraph frames into a "graph video", Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight and plug-and-play link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state-of-the-art baselines on the link prediction task in most cases. The results highlight the potential of borrowing spatio-temporal modeling techniques from computer vision as a promising and effective approach for advancing dynamic graph learning.

Chinese Translation

动态图在社交媒体、推荐系统和交通网络等现实世界系统中非常常见。现有的用于链接预测的动态图模型往往无法有效捕捉时间演化的复杂性。它们倾向于忽略时间交互顺序中的细微变化，难以处理跨越长时间范围的依赖关系，并且在建模特定对之间的关系动态方面能力有限。为了解决这些挑战，我们提出了 extbf{Graph2Video}，一个受视频启发的框架，它将目标链接的时间邻域视为一系列“图帧”。通过将按时间顺序排列的子图帧堆叠成一个“图视频”，Graph2Video 利用视频基础模型的归纳偏向来捕捉细微的局部变化和长程的时间动态。它生成一个链接级嵌入，作为轻量级且即插即用的以链接为中心的记忆单元。该嵌入可以无缝集成到现有的动态图编码器中，有效地应对了先前方法的局限性。在基准数据集上的大量实验表明，在大多数情况下，Graph2Video 在链接预测任务上优于现有的最先进基线。结果突显了借用计算机视觉中的时空建模技术作为推进动态图学习的有前景和有效的方法的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2603.13361

BrainCast: A Spatio-Temporal Forecasting Model for Whole-Brain fMRI Time Series Prediction

BrainCast：一个用于全脑fMRI时间序列预测的时空预测模型

Gao, Yunlong, Yang, Jinbo, Xiao, Li, Huo, Haiye, Ji, Yang, Wang, Hao, Zhang, Aiying, Wang, Yu-Ping

Abstract

Functional magnetic resonance imaging (fMRI) enables noninvasive investigation of brain function, while short clinical scan durations, arising from human and non-human factors, usually lead to reduced data quality and limited statistical power for neuroimaging research. In this paper, we propose BrainCast, a novel spatio-temporal forecasting framework specifically tailored for whole-brain fMRI time series forecasting, to extend informative fMRI time series without additional data acquisition. It formulates fMRI time series forecasting as a multivariate time series prediction task and jointly models temporal dynamics within regions of interest (ROIs) and spatial interactions across ROIs. Specifically, BrainCast integrates a Spatial Interaction Awareness module to characterize inter-ROI dependencies via embedding every ROI time series as a token, a Temporal Feature Refinement module to capture intrinsic neural dynamics within each ROI by enhancing both low- and high-energy temporal components of fMRI time series at the ROI level, and a Spatio-temporal Pattern Alignment module to combine spatial and temporal representations for producing informative whole-brain features. Experimental results on resting-state and task fMRI datasets from the Human Connectome Project demonstrate the superiority of BrainCast over state-of-the-art time series forecasting baselines. Moreover, fMRI time series extended by BrainCast improve downstream cognitive ability prediction, highlighting the clinical and neuroscientific impact brought by whole-brain fMRI time series forecasting in scenarios with restricted scan durations.

Chinese Translation

功能性磁共振成像（fMRI）使得对大脑功能的非侵入性研究成为可能，然而，由于人类和非人类因素导致的短时间临床扫描通常会降低数据质量，并限制神经影像研究的统计效能。本文提出了BrainCast，一种专门为全脑fMRI时间序列预测量身定制的新型时空预测框架，旨在在不增加数据采集的情况下扩展有意义的fMRI时间序列。它将fMRI时间序列预测形式化为一个多变量时间序列预测任务，并共同建模感兴趣区域（ROIs）内的时间动态和ROIs之间的空间互动。具体而言，BrainCast集成了一个空间互动意识模块，通过将每个ROI时间序列嵌入为一个标记来表征ROIs之间的依赖关系，一个时间特征精炼模块，通过增强ROI级别的fMRI时间序列的低能量和高能量时间组件来捕捉每个ROI内的内在神经动态，以及一个时空模式对齐模块，以结合空间和时间表示，生成有意义的全脑特征。基于人类连接组计划的静息态和任务fMRI数据集的实验结果展示了BrainCast相较于最先进的时间序列预测基线的优越性。此外，BrainCast扩展的fMRI时间序列提升了下游认知能力预测，突显了全脑fMRI时间序列预测在扫描时间受限场景中所带来的临床和神经科学影响。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2603.13363

IAML: Illumination-Aware Mirror Loss for Progressive Learning in Low-Light Image Enhancement Auto-encoders

IAML：用于低光照图像增强自编码器的光照感知镜像损失的渐进学习

Mohsen, Farida, Zaim, Tala, Al-Zawqari, Ali, Safa, Ali, Belhaouari, Samir

Abstract

This letter presents a novel training approach and loss function for learning low-light image enhancement auto-encoders. Our approach revolves around the use of a teacher-student auto-encoder setup coupled to a progressive learning approach where multi-scale information from clean image decoder feature maps is distilled into each layer of the student decoder in a mirrored fashion using a newly-proposed loss function termed Illumination-Aware Mirror Loss (IAML). IAML helps aligning the feature maps within the student decoder network with clean feature maps originating from the teacher side while taking into account the effect of lighting variations within the input images. Extensive benchmarking of our proposed approach on three popular low-light image enhancement datasets demonstrate that our model achieves state-of-the-art performance in terms of average SSIM, PSNR and LPIPS reconstruction accuracy metrics. Finally, ablation studies are performed to clearly demonstrate the effect of IAML on the image reconstruction accuracy.

Chinese Translation

本文提出了一种新颖的训练方法和损失函数，用于学习低光照图像增强自编码器。我们的方法围绕教师-学生自编码器设置展开，并结合渐进学习方法，将来自干净图像解码器特征图的多尺度信息以镜像方式提炼到学生解码器的每一层，使用一种新提出的损失函数，称为光照感知镜像损失（Illumination-Aware Mirror Loss, IAML）。IAML有助于将学生解码器网络中的特征图与来自教师端的干净特征图对齐，同时考虑输入图像中光照变化的影响。我们在三个流行的低光照图像增强数据集上对所提出的方法进行了广泛的基准测试，结果表明我们的模型在平均结构相似性指数（SSIM）、峰值信噪比（PSNR）和感知图像相似性（LPIPS）重建精度指标方面达到了最先进的性能。最后，进行了消融研究，以清晰展示IAML对图像重建精度的影响。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2603.13364

FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

FineRMoE：通过其升级方法实现更细粒度专家的维度扩展

Liao, Ning, Wang, Xiaoxing, Qin, Xiaohan, Yan, Junchi

Abstract

As revealed by the scaling law of fine-grained MoE, model performance ceases to be improved once the granularity of the intermediate dimension exceeds the optimal threshold, limiting further gains from single-dimension fine-grained design. To address this bottleneck, we propose FineRMoE (FineR-Grained MoE), an architecture that extends fine-grained expert design to both intermediate and output dimensions, aiming to enhance expert specialization beyond the single-dimension limit. We further introduce a bi-level sparse forward computation paradigm and a specialized routing mechanism to govern the activation. In addition, to obviate the prohibitive cost of training FineRMoE from scratch, we devise a generalized upcycling method to build FineRMoE in a cost-effective manner. Extensive experiments demonstrate the superior performance achieved by FineRMoE across ten standard benchmarks. Compared with the strongest baseline, FineRMoE achieves 6 times higher parameter efficiency, 281 times lower prefill latency, and 136 timese higher decoding throughput during inference.

Chinese Translation

根据细粒度混合专家（MoE）的规模法则，当中间维度的粒度超过最佳阈值时，模型性能将不再提升，这限制了单维细粒度设计带来的进一步收益。为了解决这一瓶颈，我们提出了FineRMoE（FineR-Grained MoE），一种将细粒度专家设计扩展到中间维度和输出维度的架构，旨在超越单维限制增强专家专业化。我们进一步引入了一种双层稀疏前向计算范式和一种专门的路由机制来管理激活。此外，为了避免从头训练FineRMoE的高昂成本，我们设计了一种通用的升级方法，以经济高效的方式构建FineRMoE。大量实验表明，FineRMoE在十个标准基准测试中实现了卓越的性能。与最强基线相比，FineRMoE在参数效率上提高了6倍，预填充延迟降低了281倍，推理过程中的解码吞吐量提高了136倍。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2603.13365

WaveComm: Lightweight Communication for Collaborative Perception via Wavelet Feature Distillation

WaveComm：通过小波特征蒸馏实现协同感知的轻量级通信

Bao, Erdemt, Yang, Jin

Abstract

In multi-agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real-time performance, especially in bandwidth-constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet-based communication framework that drastically reduces transmission loads while preserving sensing performance in low-bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low-frequency components to minimize communication overhead. High-frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi-Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR-V2X datasets for LiDAR-based and camera-based perception tasks demonstrate that WaveComm maintains state-of-the-art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.

Chinese Translation

在多智能体协同感知系统中，信息交换带来的大量通信开销显著限制了系统的可扩展性和实时性能，尤其是在带宽受限的环境中。这通常导致性能下降和可靠性降低。为了解决这一挑战，我们提出了WaveComm，一个基于小波的通信框架，能够在低带宽场景下显著减少传输负载，同时保持感知性能。WaveComm的核心创新在于使用离散小波变换（Discrete Wavelet Transform, DWT）对特征图进行分解，仅传输紧凑的低频成分，以最小化通信开销。高频细节被省略，其影响在接收端通过轻量级生成器进行重构。采用多尺度蒸馏（Multi-Scale Distillation, MSD）损失来优化像素、结构、语义和分布层面的重构质量。在基于LiDAR和相机的感知任务中，对OPV2V和DAIR-V2X数据集的实验表明，即使通信量减少到原始的86.3%和87.0%，WaveComm仍能保持最先进的性能。与现有方法相比，WaveComm在通信效率和感知准确性方面均取得了竞争性的提升。消融研究进一步验证了其关键组件的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2603.13366

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

在不确定性中思考：利用潜在熵感知解码减轻多模态大型推理模型中的幻觉

Xu, Zhongxing, Wang, Zhonghua, Qian, Zhe, Shi, Dachuan, Tang, Feilong, Hu, Ming, Su, Shiyan, Zou, Xiaocheng, Feng, Wei, Mahapatra, Dwarikanath, Peng, Yifan, Lin, Mingquan, Ge, Zongyuan

Abstract

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

Chinese Translation

近期多模态大型推理模型（MLRMs）在视觉问答领域的进展显著提升了性能。然而，我们观察到过渡词（如 because、however 和 wait）与幻觉密切相关，并且往往表现出高熵状态。我们认为，适当的上下文推理信息可以直接从词令概率分布中提取。受叠加表征理论的启发，我们提出利用潜在的叠加推理来整合多个候选语义，并维持潜在推理轨迹。我们的假设是，依赖离散文本输入可能会使模型走向顺序显式推理，在高熵推理阶段未能充分利用密集的上下文线索。因此，我们提出构建丰富的语义表征，从词令概率分布中增强上下文推理。为此，我们提出了一种潜在熵感知解码（LEAD）的方法，这是一种高效的即插即用解码策略，利用语义上下文实现可靠推理。我们方法的核心在于熵感知推理模式切换。模型在高熵状态下采用概率加权的连续嵌入，并在熵降低时切换回离散词令嵌入。此外，我们还提出了一种先验引导的视觉锚点注入策略，鼓励模型关注视觉信息。大量实验证明，LEAD 在多个基准测试中有效减轻了多种 MLRMs 的幻觉现象。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2603.13367

Multimodal Deep Learning for Dynamic and Static Neuroimaging: Integrating MRI and fMRI for Alzheimer Disease Analysis

动态与静态神经影像的多模态深度学习：整合MRI与fMRI进行阿尔茨海默病分析

Kujur, Anima, Monfared, Zahra

Abstract

Magnetic Resonance Imaging (MRI) provides detailed structural information, while functional MRI (fMRI) captures temporal brain activity. In this work, we present a multimodal deep learning framework that integrates MRI and fMRI for multi-class classification of Alzheimer Disease (AD), Mild Cognitive Impairment, and Normal Cognitive State. Structural features are extracted from MRI using 3D convolutional neural networks, while temporal features are learned from fMRI sequences using recurrent architectures. These representations are fused to enable joint spatial-temporal learning. Experiments were conducted on a small paired MRI-fMRI dataset (29 subjects), both with and without data augmentation. Results show that data augmentation substantially improves classification stability and generalization, particularly for the multimodal 3DCNN-LSTM model. In contrast, augmentation was found to be ineffective for a large-scale single-modality MRI dataset. These findings highlight the importance of dataset size and modality when designing augmentation strategies for neuroimaging-based AD classification.

Chinese Translation

磁共振成像（MRI）提供了详细的结构信息，而功能性磁共振成像（fMRI）则捕捉了大脑的时间活动。在本研究中，我们提出了一种多模态深度学习框架，整合MRI与fMRI用于阿尔茨海默病（AD）、轻度认知障碍和正常认知状态的多类别分类。结构特征通过3D卷积神经网络从MRI中提取，而时间特征则通过递归架构从fMRI序列中学习。这些表示被融合以实现联合的时空学习。我们在一个小型配对的MRI-fMRI数据集（29个受试者）上进行了实验，包括有和没有数据增强的情况。结果表明，数据增强显著提高了分类的稳定性和泛化能力，尤其是对于多模态3DCNN-LSTM模型。相反，对于大规模单模态MRI数据集，发现数据增强效果不佳。这些发现突显了在设计基于神经影像的阿尔茨海默病分类的增强策略时，数据集大小和模态的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2603.13368

Real-Time Monocular Scene Analysis for UAV in Outdoor Environments

户外环境下无人机的实时单目场景分析

AlaaEldin, Yara

Abstract

In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited. We introduce a new synthetic dataset in this thesis, TopAir that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic to real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. The results reveal good generalization performance of Co-SemDepth when tested on real data from the SMD dataset while further enhancement is needed on the MIT dataset.

Chinese Translation

在本论文中，我们利用空中机器人上的单目相机预测低空非结构化环境中的深度和语义图。我们提出了一种名为 Co-SemDepth 的联合深度学习架构，能够快速准确地执行这两项任务，并在多种数据集上验证其有效性。神经网络的训练需要大量标注数据，而在无人机领域，这类数据的可用性有限。为此，我们在本论文中引入了一个新的合成数据集 TopAir，该数据集包含在不同高度的户外环境中以天顶视角捕获的图像，帮助填补这一空白。尽管使用合成数据进行训练很方便，但在转向真实领域进行测试时会出现问题。我们进行了广泛的分析研究，以评估多个因素对合成到真实泛化的影响。Co-SemDepth 和 TaskPrompter 模型在本研究中用于比较。结果显示，Co-SemDepth 在深度估计方面具有优越的泛化性能，而 TaskPrompter 在语义分割方面表现更佳。此外，我们的分析使我们能够确定哪些训练数据集能够导致更好的泛化。此外，为了帮助减小合成域和真实域之间的差距，我们探索了图像风格迁移技术，旨在将合成风格转换为真实风格。我们采用了 Cycle-GAN 和扩散模型。结果表明，扩散模型在合成到真实的风格迁移中表现更佳。最后，我们聚焦于海洋领域并解决其挑战。Co-SemDepth 在收集的合成海洋数据集 MidSea 上进行训练，并在合成和真实数据上进行测试。结果显示，Co-SemDepth 在 SMD 数据集的真实数据测试中表现出良好的泛化性能，而在 MIT 数据集上则需要进一步增强。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2603.13369

Disentangling Prompt Dependence to Evaluate Segmentation Reliability in Gynecological MRI

解构提示依赖性以评估妇科MRI中的分割可靠性

Germani, Elodie, Nyangoh-Timoh, Krystel, Jannin, Pierre, Baxter, John S H

Abstract

Promptable segmentation models (e.g., the Segment Anything Models) enable generalizable, zero-shot segmentation across diverse domains. Although predictions are deterministic for a fixed image-prompt pair, the robustness of these models to variations in user prompts, referred to as prompt dependence, remains underexplored. In safety-critical workflows with substantial inter-user variability, interpretable and informative frameworks are needed to evaluate prompt dependence. In this work, we assess the reliability of promptable segmentation by analyzing and measuring its sensitivity to prompt variability. We introduce the first formulation of prompt dependence that explicitly disentangles prompt ambiguity (inter-user variability) from local sensitivity (interaction imprecision), offering an interpretable view of segmentation robustness. Experiments on two female pelvic MRI datasets for uterus and bladder segmentation reveal a strong negative correlation between both metrics and segmentation performance, highlighting the value of our framework for assessing robustness. The two metrics have low mutual correlation, supporting the disentangled design of our formulation, and provide meaningful indicators of prompt-related failure modes.

Chinese Translation

可提示的分割模型（例如，Segment Anything Models）能够在不同领域实现可泛化的零样本分割。尽管对于固定的图像-提示对，预测是确定性的，但这些模型对用户提示变化的鲁棒性，即提示依赖性，仍然未被充分探讨。在具有显著用户间变异性的安全关键工作流程中，需要可解释且信息丰富的框架来评估提示依赖性。在本研究中，我们通过分析和测量其对提示变异性的敏感性来评估可提示分割的可靠性。我们首次提出了提示依赖性的公式，明确将提示模糊性（用户间变异性）与局部敏感性（交互不精确性）解构开来，从而提供了分割鲁棒性的可解释视角。在两个女性盆腔MRI数据集上进行的子宫和膀胱分割实验显示，这两个指标与分割性能之间存在强负相关，突显了我们框架在评估鲁棒性方面的价值。这两个指标之间的互相关联性较低，支持了我们公式的解构设计，并提供了与提示相关的失败模式的有意义指标。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2603.13370

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

GraphVLM：多模态图学习的视觉语言模型基准测试

Liu, Jiajin, Fan, Dongzhe, Ji, Chuanhao, Zha, Daochen, Tan, Qiaoyu

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at https://github.com/oamyjin/GraphVLM.

Chinese Translation

视觉语言模型（VLMs）在对齐和理解多模态信号方面表现出色，但它们在结构化数据上的推理能力仍然未得到充分探索，尤其是在多模态实体通过显式关系图连接的情况下。释放这一能力对于社交网络、推荐系统和科学发现等现实世界应用至关重要，因为多模态信息本质上是结构化的。为了解决这一问题，我们提出了GraphVLM，这是一个系统性的基准测试，旨在评估和利用VLMs在多模态图学习（MMGL）中的能力。GraphVLM探讨了三种互补的范式，将VLMs与图推理相结合：（1）VLM作为编码器，通过多模态特征融合增强图神经网络；（2）VLM作为对齐器，在潜在空间或语言空间中桥接模态，以促进基于LLM的结构化推理；（3）VLM作为预测器，直接将VLMs作为图学习任务的多模态骨干。针对来自不同领域的六个数据集进行的广泛实验表明，VLMs通过这三种角色增强了多模态图学习。在这些范式中，VLM作为预测器实现了最显著和一致的性能提升，揭示了视觉语言模型作为多模态图学习新基础的潜力。基准代码已公开发布在 https://github.com/oamyjin/GraphVLM。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2603.13371

Agentic LLM Workflow for MR Spectroscopy Volume-of-Interest Placements in Brain Tumors

用于脑肿瘤磁共振波谱体积兴趣区放置的自主大型语言模型工作流程

Lee, Sangyoon, Branzoli, Francesca, Marjańska, Małgorzata, Bolan, Patrick

Abstract

Magnetic resonance spectroscopy (MRS) provides clinically valuable metabolic characterization of brain tumors, but its utility depends on accurate placement of the spectroscopy volume-of-interest (VOI). However, VOI placement typically has a broad operating window: for a given tumor there are multiple possible VOIs that would lead to high-quality MRS measurements. Thus, a VOI place-ment can be tuned for clinician preference, case-specific anatomy, and clinical pri-orities, which leads to high inter-operator variability, especially for heterogeneous tumors. We propose an agentic large language model (LLM) workflow that de-composes VOI placement into generation of diverse candidate VOIs, from which the LLM selects an optimal one based on quantitative metrics. Candidate VOIs are generated by vision transformer-based placement models trained with differ-ent objective function preferences, which allows selection from acceptable alterna-tives rather than a single deterministic placement. On 110 clinical brain tumor cas-es, the agentic workflow achieves improved solid tumor coverage and necrosis avoidance depending on the user preferences compared to the general-purpose expert placements. Overall, the proposed workflow provides a strategy to adapt VOI placement to different clinical objectives without retraining task-specific models.

Chinese Translation

磁共振波谱（MRS）为脑肿瘤提供了临床上有价值的代谢特征，但其效用依赖于对波谱体积兴趣区（VOI）的准确放置。然而，VOI放置通常具有较宽的操作窗口：对于给定的肿瘤，有多个可能的VOI可以导致高质量的MRS测量。因此，VOI的放置可以根据临床医生的偏好、特定病例的解剖结构和临床优先事项进行调整，这导致了较高的操作员间变异性，尤其是在异质性肿瘤中。我们提出了一种自主大型语言模型（LLM）工作流程，将VOI放置分解为生成多样的候选VOI，LLM根据定量指标从中选择最佳的一个。候选VOI由基于视觉变换器的放置模型生成，这些模型经过不同目标函数偏好的训练，使得可以从可接受的替代方案中进行选择，而不是单一的确定性放置。在110个临床脑肿瘤案例中，该自主工作流程在用户偏好的基础上，相较于通用专家放置，实现了更好的实体肿瘤覆盖和坏死避免。总体而言，所提出的工作流程提供了一种将VOI放置适应于不同临床目标的策略，而无需重新训练特定任务的模型。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2603.13374

Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

基于几何的语义推理用于无训练的视频异常检测

Zia, Ali, Ali, Usman, Ramzan, Muhammad Umer, Abid, Hamza, Rehman, Abdul, Xiang, Wei

Abstract

Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

Chinese Translation

无训练的视频异常检测（VAD）最近作为一种可扩展的替代方案出现，以取代监督学习方法。然而，现有方法在很大程度上依赖于静态提示和与几何无关的特征融合。因此，异常推断往往被简化为对欧几里得嵌入的浅层相似性匹配，这导致预测不稳定且可解释性有限，尤其是在复杂或分层结构的场景中。我们提出了MM-VAD，一种基于几何的语义推理框架，用于无训练的VAD，将异常检测重新构建为自适应的测试时推断，而非固定的特征比较。我们的方法将基于标题的场景表示投影到超曲面空间，以更好地保留层次结构，并通过对冻结的大型语言模型进行自适应问答过程来进行异常评估。在测试时，使用无监督的置信度稀疏目标优化轻量级、可学习的提示，实现上下文特定的校准，而无需更新任何主干参数。为了进一步将语义预测与视觉证据相结合，我们纳入了一种考虑协方差的马哈拉诺比斯细化方法，以稳定跨模态对齐。在四个基准测试中，MM-VAD始终优于之前的无训练方法，在XD-Violence上达到了90.03%的AUC，在UCF-Crime、ShanghaiTech和UCSD Ped2上分别达到了83.24%、96.95%和98.81%。我们的结果表明，基于几何的表示和自适应语义校准为无训练VAD中的静态欧几里得匹配提供了一个有原则且有效的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2603.13375

InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization

InfiniteDance：可扩展的3D舞蹈生成以实现野外泛化

Li, Ronghui, Hu, Zhongyuan, Siyao, Li, Zhang, Youliang, Xie, Haozhe, Zhang, Mingyuan, Guo, Jie, Li, Xiu, Liu, Ziwei

Abstract

Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.

Chinese Translation

尽管现有的3D舞蹈生成方法在受控场景中表现良好，但它们在野外环境中的泛化能力往往较弱。当面对未见过的音乐时，现有方法通常会生成无结构或在物理上不合理的舞蹈，这主要是由于音乐到舞蹈的数据有限以及模型能力受限。本研究旨在通过扩大数据和模型设计的规模，推动可泛化3D舞蹈生成的前沿。(1) 在数据方面，我们开发了一个全自动化的管道，从单目视频中重建高保真度的3D舞蹈动作。为了消除现有重建方法中普遍存在的物理伪影，我们引入了一种脚部恢复扩散模型（Foot Restoration Diffusion Model，FRDM），该模型通过脚接触和几何约束引导，强制执行物理合理性，同时保持运动的平滑性和表现力，最终生成一个多样化的高质量多模态3D舞蹈数据集，总计100.69小时。(2) 在模型设计方面，我们提出了编舞LLaMA（Choreographic LLaMA，ChoreoLLaMA），这是一种可扩展的基于LLaMA的架构。为了增强在不熟悉音乐条件下的鲁棒性，我们集成了一个检索增强生成（Retrieval-Augmented Generation，RAG）模块，该模块将参考舞蹈作为提示注入。此外，我们设计了一个慢/快节奏的专家混合（Mixture-of-Experts，MoE）模块，使ChoreoLLaMA能够在不同音乐节奏之间平滑地调整运动节奏。针对不同舞蹈风格的广泛实验表明，我们的方法在定性和定量评估中均优于现有方法，标志着可扩展的现实世界3D舞蹈生成迈出了重要一步。代码、模型和数据将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2603.13376

A Computer-aided Framework for Detecting Osteosarcoma in Computed Tomography Scans

一种计算机辅助框架用于检测计算机断层扫描中的骨肉瘤

Rodriguez-Herrero, Maximo, Sanchez-Gallegos, Dante D., Núñez-Gaona, Marco Antonio, Aguirre-Meneses, Heriberto, Gutiérrez, Luis Alberto Villalvazo, Velasco, Mario Ibrahin Gutiérrez, Gonzalez-Compean, J. L., Carretero, Jesus

Abstract

Osteosarcoma is the most common primary bone cancer, mainly affecting the youngest and oldest populations. Its detection at early stages is crucial to reduce the probability of developing bone metastasis. In this context, accurate and fast diagnosis is essential to help physicians during the prognosis process. The research goal is to automate the diagnosis of osteosarcoma through a pipeline that includes the preprocessing, detection, postprocessing, and visualization of computed tomography (CT) scans. Thus, this paper presents a machine learning and visualization framework for classifying CT scans using different convolutional neural network (CNN) models. Preprocessing includes data augmentation and identification of the region of interest in scans. Post-processing includes data visualization to render a 3D bone model that highlights the affected area. An evaluation on 12 patients revealed the effectiveness of our framework, obtaining an area under the curve (AUC) of 94.8\% and a specificity of 94.6\%.

Chinese Translation

骨肉瘤是最常见的原发性骨癌，主要影响年轻和年长人群。早期检测对于降低骨转移的发生概率至关重要。在此背景下，准确快速的诊断对于帮助医生进行预后过程至关重要。本研究的目标是通过一个包含预处理、检测、后处理和计算机断层扫描（CT）可视化的流程，自动化骨肉瘤的诊断。因此，本文提出了一种机器学习和可视化框架，使用不同的卷积神经网络（CNN）模型对CT扫描进行分类。预处理包括数据增强和扫描中感兴趣区域的识别。后处理包括数据可视化，以渲染出突出受影响区域的3D骨模型。对12名患者的评估显示了我们框架的有效性，获得了94.8%的曲线下面积（AUC）和94.6%的特异性。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2603.13377

Deep Learning for BioImaging: What Are We Learning?

深度学习在生物成像中的应用：我们学到了什么？

Svatko, Ivan, Sanchez, Maxime, Bendidi, Ihab, Cottrell, Gilles, Genovesio, Auguste

Abstract

Representation learning has driven major advances in natural image analysis by enabling models to acquire high-level semantic features. In microscopy imaging, however, it remains unclear what current representation learning methods actually learn. In this work, we conduct a systematic study of representation learning for the two most widely used and broadly available microscopy data types, representing critical scales in biology: cell culture and tissue imaging. To this end, we introduce a set of simple yet revealing baselines on curated benchmarks, including untrained models and simple structural representations of cellular tissue. Our results show that, surprisingly, state-of-the-art methods perform comparably to these baselines. We further show that, in contrast to natural images, existing models fail to consistently acquire high-level, biologically meaningful features. Moreover, we demonstrate that commonly used benchmark metrics are insufficient to assess representation quality and often mask this limitation. In addition, we investigate how detailed comparisons with these benchmarks provide ways to interpret the strengths and weaknesses of models for further improvements. Together, our results suggest that progress in microscopy image representation learning requires not only stronger models, but also more diagnostic benchmarks that measure what is actually learned.

Chinese Translation

表示学习通过使模型能够获取高层次的语义特征，推动了自然图像分析的重大进展。然而，在显微镜成像中，目前的表示学习方法实际学到了什么仍然不清楚。在本研究中，我们对两种最广泛使用且可广泛获取的显微镜数据类型进行了系统研究，这些数据类型代表了生物学中的关键尺度：细胞培养和组织成像。为此，我们在经过精心挑选的基准上引入了一组简单但具有启发性的基线，包括未训练的模型和细胞组织的简单结构表示。我们的结果表明，令人惊讶的是，最先进的方法与这些基线的表现相当。我们进一步展示，与自然图像相比，现有模型未能持续获取高层次的、生物学上有意义的特征。此外，我们证明，常用的基准指标不足以评估表示质量，并且常常掩盖了这一局限性。此外，我们研究了如何通过与这些基准的详细比较来解释模型的优缺点，以便进行进一步改进。综合来看，我们的结果表明，显微镜图像表示学习的进展不仅需要更强的模型，还需要更具诊断性的基准，以衡量实际学到的内容。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2603.13382

DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1

基于DINOv3的测试时校准在CUBS v1上自动化测量颈动脉内中膜厚度

Zhang, Zhenpeng, Lu, Jinwei, Dong, Yurui, Yuan, Bo

Abstract

Carotid intima-media thickness (CIMT) measured from B-mode ultrasound is an established vascular biomarker for atherosclerosis and cardiovascular risk stratification. Although a wide range of computerized methods have been proposed for carotid boundary delineation and CIMT estimation, robust and transferable deep models that jointly address segmentation and measurement remain underexplored, particularly in the era of vision foundation models. Motivated by recent advances in adapting DINOv3 to medical segmentation and exploiting DINOv3 in test-time optimization pipelines, we investigate a DINOv3-based framework for carotid intima-media complex segmentation and subsequent CIMT measurement on the Carotid Ultrasound Boundary Study (CUBS) v1 dataset. Our pipeline predicts the intima-media band at a fixed image resolution, extracts upper and lower boundaries column-wise, corrects for image resizing using the per-image calibration factor provided by CUBS, and reports CIMT in physical units. Across three patient-level test splits, our method achieved a mean test Dice of 0.7739 $\pm$ 0.0037 and IoU of 0.6384 $\pm$ 0.0044. The mean CIMT absolute error was 181.16 $\pm$ 11.57 $\mu$m, with a mean Pearson correlation of 0.480 $\pm$ 0.259. In a held-out validation subset ($n=28$), test-time threshold calibration reduced the mean absolute CIMT error from 141.0 $\mu$m at the default threshold to 101.1 $\mu$m at the measurement-optimized threshold, while simultaneously reducing systematic bias toward zero. Relative to the error ranges reported in the original CUBS benchmark for classical computerized methods, these results place a DINOv3-based approach within the clinically relevant $\sim$0.1 mm measurement regime. Together, our findings support the feasibility of using vision foundation models for interpretable, calibration-aware CIMT measurement.

Chinese Translation

颈动脉内中膜厚度（CIMT）通过B超测量，是一种已建立的动脉粥样硬化和心血管风险分层的血管生物标志物。尽管已经提出了多种计算机化方法用于颈动脉边界的划分和CIMT的估计，但在分割和测量方面同时解决这些问题的稳健且可转移的深度模型仍然未得到充分探索，尤其是在视觉基础模型的时代。受到最近将DINOv3应用于医学分割和利用DINOv3进行测试时优化流程的进展的启发，我们研究了一种基于DINOv3的框架，用于颈动脉内中膜复合体的分割及随后在颈动脉超声边界研究（CUBS）v1数据集上进行CIMT测量。我们的流程在固定的图像分辨率下预测内中膜带，逐列提取上下边界，使用CUBS提供的每幅图像校准因子校正图像缩放，并以物理单位报告CIMT。在三个患者级别的测试分割中，我们的方法达到了平均测试Dice系数为0.7739 ± 0.0037，IoU为0.6384 ± 0.0044。平均CIMT绝对误差为181.16 ± 11.57 μm，平均Pearson相关系数为0.480 ± 0.259。在一个保留的验证子集（n=28）中，测试时阈值校准将平均绝对CIMT误差从默认阈值下的141.0 μm降低到测量优化阈值下的101.1 μm，同时系统性偏差也减少至接近零。相较于原始CUBS基准中经典计算方法报告的误差范围，这些结果将基于DINOv3的方法置于临床相关的约0.1 mm测量范围内。总的来说，我们的研究结果支持使用视觉基础模型进行可解释的、关注校准的CIMT测量的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2603.13383

Taming Vision Priors for Data Efficient mmWave Channel Modeling

驯化视觉先验以实现数据高效的毫米波信道建模

An, Zhenlin, Shangguan, Longfei, Kaewell, John, Pietraski, Philip, Senic, Jelena, Gentile, Camillo, Golmie, Nada, Jamieson, Kyle

Abstract

Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.

Chinese Translation

准确建模毫米波（mmWave）传播对于实时增强现实（AR）和自主系统至关重要。可微分光线追踪提供了一种基于物理的解决方案，但由于过度依赖全面的信道测量或脆弱的、手动调整的场景模型来获取材料特性，仍面临部署挑战。我们提出了VisRFTwin，这是一种可扩展且数据高效的数字双胞胎框架，集成了视觉衍生的材料先验与可微分光线追踪。来自普通相机的多视角图像通过冻结的视觉-语言模型处理，以提取密集的语义嵌入，这些嵌入被转换为场景表面的介电常数和导电率的初始估计。这些先验初始化了基于Sionna的可微分光线追踪器，该追踪器通过梯度下降快速校准材料参数，仅需几十个稀疏的信道探测。一旦校准，视觉特征与材料参数之间的关联得以保留，从而实现快速转移到新的场景而无需重复校准。在三个真实场景（包括办公室内部、城市峡谷和动态公共空间）中的评估表明，VisRFTwin将信道测量需求减少了多达10倍，同时实现了比纯数据驱动的深度学习方法低59%的中位延迟扩展误差。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2603.13385

VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering

VisualLeakBench：审计大型视觉-语言模型对个人身份信息泄露和社会工程攻击的脆弱性

Wang, Youting, Tang, Yuan, Qian, Yitian, Zhao, Chen

Abstract

As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated -- alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude~4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude~4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern -- where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4's leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.

Chinese Translation

随着大型视觉-语言模型（LVLMs）在集成代理工作流程和其他相关部署环境中的日益广泛应用，它们对语义视觉攻击的鲁棒性仍然未得到充分评估——对齐通常是在明确的有害内容上进行测试，而不是在隐私关键的多模态场景中。我们提出了VisualLeakBench，一个评估套件，用于审计LVLMs在OCR注入和上下文个人身份信息（PII）泄露方面的表现，使用1,000张合成生成的对抗性图像，涵盖8种PII类型，并在50个现实世界（IRL）真实截图上进行了验证，涵盖多样的视觉上下文。我们评估了四个前沿系统（GPT-5.2、Claude~4、Gemini-3 Flash、Grok-4），并使用Wilson 95%置信区间进行分析。Claude~4的OCR ASR（14.2%）最低，但PII ASR（74.4%）最高，表现出一种先合规后警告的模式——逐字数据泄露发生在任何安全导向语言之前。Grok-4的PII ASR最低（20.4%）。一个防御性系统提示消除了两个模型的PII泄露，将Claude~4的泄露从74.4%降低到2.2%，但对Gemini-3 Flash在合成数据上没有影响。值得注意的是，IRL验证显示Gemini-3 Flash确实对现实世界图像的缓解措施做出了反应（从50%降至0%），表明缓解的鲁棒性对模板敏感，而不是普遍缺失。我们发布了我们的数据集和代码，以便对与部署相关的视觉-语言系统进行可重复的鲁棒性和安全性评估。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2603.13386

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

基于布局引导的可控病理图像生成与上下文扩散变换器

Shou, Yuntao, Cao, Xiangyong, Zhao, Qian, Meng, Deyu

Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

Chinese Translation

可控的病理图像合成需要对空间布局、组织形态和语义细节进行可靠的调节。然而，现有的文本引导扩散模型仅提供粗略的全局控制，缺乏强制执行细粒度结构约束的能力。由于缺乏将补丁级空间布局与详细诊断描述相结合的大型数据集，进展进一步受限，因为为千兆像素全切片图像生成此类注释对人类专家来说耗时过长。为了解决这些挑战，我们首先开发了一个可扩展的多代理LVLM注释框架，该框架将图像描述、诊断步骤提取和自动质量判断整合到一个协调的流程中，并通过人工验证过程评估系统的可靠性。该框架能够高效地构建细粒度和临床对齐的监督。基于整理的数据，我们提出了上下文扩散变换器（In-Context Diffusion Transformer，IC-DiT），这是一种布局感知的生成模型，将空间布局、文本描述和视觉嵌入整合到一个统一的扩散变换器中。通过层次化的多模态注意机制，IC-DiT在保持全局语义一致性的同时，准确保留结构和形态细节。在五个组织病理学数据集上的广泛实验表明，IC-DiT在保真度、空间可控性和诊断一致性方面优于现有方法。此外，生成的图像作为有效的数据增强资源，可用于癌症分类和生存分析等下游任务。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2603.13387

Cylindrical Mechanical Projector for Omnidirectional Fringe Projection Profilometry

用于全方向条纹投影轮廓测量的圆柱形机械投影仪

Choi, Mincheol, Kim, Gaeun, Hyun, Jae-Sang

Abstract

The demand for 360-degree 3D reconstruction has significantly increased in recent years across various domains such as the metaverse and 3D telecommunication. Accordingly, the importance of precise and wide-area 3D sensing technology has become emphasized. While the digital fringe projection method has been widely used due to its high accuracy and implementation flexibility, it suffers from fundamental limitations such as unidirectional projection and a restricted available light spectrum. To address these issues, this paper proposes a novel 3D reconstruction method based on a cylindrical mechanical projector. The proposed method consists of a rotational stage and a cylindrical pattern generator with ON/OFF slots at two distinct intervals, enabling omnidirectional projection of multi-frequency phase-shifted fringe patterns. By applying a multi-wavelength unwrapping algorithm and a quasi-calibration technique, the system achieves high-accuracy 3D reconstruction using only a single camera. Experimental results, supported by repeatability and reproducibility analyses together with a measurement uncertainty evaluation, confirm reliable measurement performance and practical feasibility for omnidirectional 3D reconstruction. The expanded uncertainty of the reconstructed depth was evaluated as 0.215 mm.

Chinese Translation

近年来，360度3D重建的需求在元宇宙和3D通信等多个领域显著增加。因此，精确和广域3D传感技术的重要性愈加突出。尽管数字条纹投影方法因其高精度和实施灵活性而被广泛应用，但仍存在单向投影和可用光谱范围受限等基本局限性。为了解决这些问题，本文提出了一种基于圆柱形机械投影仪的新型3D重建方法。该方法由一个旋转平台和一个具有两个不同间隔的开关槽的圆柱形图案生成器组成，能够实现多频率相位偏移条纹图案的全方向投影。通过应用多波长展开算法和准校准技术，该系统仅使用一台相机即可实现高精度的3D重建。实验结果通过重复性和再现性分析以及测量不确定性评估得到了支持，确认了可靠的测量性能和全方向3D重建的实际可行性。重建深度的扩展不确定性评估为0.215毫米。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2603.13388

VeloEdit: Training-Free Consistent and Continuous Instruction-Based Image Editing via Velocity Field Decomposition

VeloEdit：基于速度场分解的无训练一致且连续的指令驱动图像编辑

Li, Zongqing, Liu, Zhihui, Xie, Yujie, Wu, Shansiyuan, Lv, Hongshen, Su, Songzhi

Abstract

Instruction-based image editing aims to modify source content according to textual instructions. However, existing methods built upon flow matching often struggle to maintain consistency in non-edited regions due to denoising-induced reconstruction errors that cause drift in preserved content. Moreover, they typically lack fine-grained control over edit strength. To address these limitations, we propose VeloEdit, a training-free method that enables highly consistent and continuously controllable editing. VeloEdit dynamically identifies editing regions by quantifying the discrepancy between the velocity fields responsible for preserving source content and those driving the desired edits. Based on this partition, we enforce consistency in preservation regions by substituting the editing velocity with the source-restoring velocity, while enabling continuous modulation of edit intensity in target regions via velocity interpolation. Unlike prior works that rely on complex attention manipulation or auxiliary trainable modules, VeloEdit operates directly on the velocity fields. Extensive experiments on Flux.1 Kontext and Qwen-Image-Edit demonstrate that VeloEdit improves visual consistency and editing continuity with negligible additional computational cost. Code is available at https://github.com/xmulzq/VeloEdit.

Chinese Translation

基于指令的图像编辑旨在根据文本指令修改源内容。然而，现有基于流匹配的方法往往难以保持未编辑区域的一致性，因为去噪引起的重建误差会导致保留内容的漂移。此外，它们通常缺乏对编辑强度的细粒度控制。为了解决这些局限性，我们提出了VeloEdit，这是一种无训练的方法，能够实现高度一致且可连续控制的编辑。VeloEdit通过量化负责保留源内容的速度场与驱动所需编辑的速度场之间的差异，动态识别编辑区域。基于这一划分，我们通过用源恢复速度替代编辑速度来强制保持保留区域的一致性，同时通过速度插值在目标区域实现编辑强度的连续调节。与依赖复杂注意力操作或辅助可训练模块的先前工作不同，VeloEdit直接在速度场上进行操作。在Flux.1 Kontext和Qwen-Image-Edit上的大量实验表明，VeloEdit在几乎不增加额外计算成本的情况下提高了视觉一致性和编辑连续性。代码可在 https://github.com/xmulzq/VeloEdit 获取。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2603.13389

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

通过分布条件扩散解码从预训练视觉-语言模型生成高保真文本到图像

Hong, Ji Woo, Yoon, Hee Suk, Koo, Gwanhyeong, Yoon, Eunseop, Eom, SooHwan, Dai, Qi, Luo, Chong, Yoo, Chang D.

Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

Chinese Translation

近期的大规模视觉-语言模型（VLMs）展现了卓越的文本到图像生成能力，但其视觉保真度仍受到离散图像标记化的限制，这带来了重大挑战。尽管一些研究探讨了连续表示建模以增强视觉质量，但将预训练的VLM模型适配为这种表示需要相当于原始预训练的大规模数据和训练成本。为了解决这一限制，我们提出了一种基于扩散的解码框架，通过仅在预训练VLM的输出图像标记logits上训练一个扩散解码器来提高图像的保真度，从而保持原始模型的完整性。该框架的核心是Logit-to-Code分布映射，它将VLM的图像标记logits转换为具有不确定性特征的连续分布加权代码向量，为扩散解码提供了有效的条件信号。一种轻量级的Logit标定将来自VQ-VAE编码器的训练时间代理logits与VLM生成的logits对齐，从而减轻了训练与推理之间的差距。在这些表示的条件下，分布条件扩散解码器生成高保真的图像。我们的研究表明，仅通过在ImageNet-1K上的短期训练，方法在VQ-VAE重建和从VLM预测标记生成的文本到图像中均持续改善了视觉保真度。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2603.13391

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

WebVR：基于人类对齐视觉标准评估多模态大语言模型从视频重建网页的基准

Dai, Yuhong, Lai, Yanlin, Huang, Mitt, Guo, Hangyu, Li, Dingming, Peng, Hongbo, Li, Haodong, Zhao, Yingxiu, Lyu, Haoran, Ge, Zheng, Zhang, Xiangyu, Jiang, Daxin

Abstract

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.

Chinese Translation

现有的网页生成基准依赖于文本提示或静态截图作为输入。然而，视频自然传达了更丰富的信号，如交互流程、过渡时机和运动连续性，这些对于忠实的网页重建至关重要。尽管具有这种潜力，基于视频的网页生成仍然基本未被探索，且没有专门针对该任务的基准。为填补这一空白，我们引入了WebVR，一个评估多模态大语言模型（MLLMs）能否忠实重建网页的基准。WebVR包含175个来自不同类别的网页，所有网页均通过受控合成流程构建，而非网络爬虫，确保展示的多样性和真实性，并且与现有在线页面没有重叠。我们还设计了一种细致的、与人类对齐的视觉标准，从多个维度评估生成的网页。在对19个模型的实验中，发现重建细致风格和运动质量存在显著差距，而基于标准的自动评估与人类偏好的达成率达到96%。我们发布了数据集、评估工具包和基线结果，以支持未来在视频到网页生成方面的研究。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2603.13392

Comparative Analysis of Deep Learning Architectures for Multi-Disease Classification of Single-Label Chest X-rays

多疾病分类单标签胸部X光片的深度学习架构比较分析

Bahram, Ali M., Omer, Saman Muhammad, Mohammed, Hardi M.

Abstract

Chest X-ray imaging remains the primary diagnostic tool for pulmonary and cardiac disorders worldwide, yet its accuracy is hampered by radiologist shortages and inter-observer variability. This study presents a systematic comparative evaluation of seven deep learning architectures for multi-class chest disease classification: ConvNeXt-Tiny, DenseNet121, DenseNet201, ResNet50, ViT-B/16, EfficientNetV2-M, and MobileNetV2. A balanced dataset of 18,080 chest X-ray images spanning five disease categories (Cardiomegaly, COVID-19, Normal, Pneumonia, and Tuberculosis) was constructed from three public repositories and partitioned at the patient level to prevent data leakage. All models were trained under identical conditions using ImageNet-pretrained weights, standardized preprocessing, and consistent hyperparameters. All seven architectures exceeded 90% test accuracy. ConvNeXt-Tiny achieved the highest performance (92.31% accuracy, 95.70% AUROC), while MobileNetV2 emerged as the most parameter-efficient model (3.5M parameters, 90.42% accuracy, 94.10% AUROC), completing training in 48 minutes. Tuberculosis and COVID-19 classification was near-perfect (AUROC >= 99.97%) across all architectures, while Normal, Cardiomegaly, and Pneumonia presented greater challenges due to overlapping radiographic features. Grad-CAM visualizations confirmed clinically consistent attention patterns across disease categories. These findings demonstrate that high-accuracy multi-disease chest X-ray classification is achievable without excessive computational resources, with important implications for AI-assisted diagnosis in both resource-rich and resource-constrained healthcare settings.

Chinese Translation

胸部X光成像仍然是全球肺部和心脏疾病的主要诊断工具，但其准确性受到放射科医生短缺和观察者间变异性的影响。本研究对七种深度学习架构进行了系统的比较评估，以进行多类别胸部疾病分类：ConvNeXt-Tiny、DenseNet121、DenseNet201、ResNet50、ViT-B/16、EfficientNetV2-M和MobileNetV2。构建了一个包含18,080张胸部X光图像的平衡数据集，涵盖五种疾病类别（心脏肥大、COVID-19、正常、肺炎和结核病），数据集来自三个公共库，并在患者级别进行划分以防止数据泄漏。所有模型在相同条件下进行训练，使用ImageNet预训练权重、标准化预处理和一致的超参数。所有七种架构的测试准确率均超过90%。ConvNeXt-Tiny达到了最高性能（92.31%准确率，95.70% AUROC），而MobileNetV2则成为参数效率最高的模型（3.5M参数，90.42%准确率，94.10% AUROC），训练时间为48分钟。结核病和COVID-19的分类在所有架构中几乎完美（AUROC >= 99.97%），而正常、心脏肥大和肺炎由于重叠的放射特征面临更大挑战。Grad-CAM可视化确认了疾病类别间临床一致的注意模式。这些发现表明，在不需要过多计算资源的情况下，实现高准确率的多疾病胸部X光分类是可行的，这对资源丰富和资源匮乏的医疗环境中的人工智能辅助诊断具有重要意义。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2603.13393

Colony Grounded SAM2: Zero-shot detection and segmentation of bacterial colonies using foundation models

基于Colony Grounded SAM2的细菌菌落零样本检测与分割方法

Korporaal, Daan, de Kruijf, Patrick, Litjens, Ralph H. G. M., van der Velden, Bas H. M.

Abstract

The detection and classification of bacterial colonies in images of agar-plates is important in microbiology, but is hindered by the lack of labeled datasets. Therefore, we propose Colony Grounded SAM2, a zero-shot inference pipeline to detect and segment bacterial colonies in multiple settings without any further training. By utilizing the pre-trained foundation models Grounding DINO and Segment Anything Model 2, fine-tuned to the microbiological domain, we developed a model that is robust to data changes. Results showed a mean Average Precision of 93.1\% and a $Dice@detection$ score of 0.85, showing excellent detection and segmentation capabilities on out-of-distribution datasets. The entire pipeline with model weights are shared open access to aid with annotation- and classification purposes in microbiology.

Chinese Translation

在琼脂平板图像中检测和分类细菌菌落对于微生物学至关重要，但由于缺乏标注数据集而受到限制。因此，我们提出了Colony Grounded SAM2，这是一种零样本推理流程，能够在多种环境中检测和分割细菌菌落，而无需进一步训练。通过利用经过微生物学领域微调的预训练基础模型Grounding DINO和Segment Anything Model 2，我们开发了一种对数据变化具有鲁棒性的模型。结果显示，平均精度均值为93.1\%，$Dice@detection$得分为0.85，表明在分布外数据集上具有出色的检测和分割能力。整个流程及模型权重已开放共享，以帮助微生物学中的标注和分类工作。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2603.13394

Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

基于语言引导的强化学习在大型视觉语言模型中的标记压缩

Cao, Sihan, Zhang, Jianwei, Zheng, Pengcheng, Yan, Jiaxin, Qin, Caiyan, Ye, Yalan, Dong, Wei, Wang, Peng, Yang, Yang, Zhang, Chaoning

Abstract

Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{https://github.com/MagicVicCoder/TPRL}{\textcolor{mypink}{https://github.com/MagicVicCoder/TPRL}}.

Chinese Translation

大型视觉语言模型（LVLMs）由于处理大量视觉标记而产生了可观的推理成本。现有方法通常难以将渐进的视觉标记减少建模为具有序列依赖性的多步骤决策过程，并且往往依赖于缺乏适应性优化的手工设计评分规则，这对于复杂推理轨迹而言效果不佳。为了解决这些局限性，我们提出了TPRL，一个通过语言引导的序列优化直接与最终任务性能相关联的强化学习框架，旨在学习自适应的剪枝轨迹。我们将视觉标记剪枝形式化为具有明确状态转移的序列决策过程，并采用自监督自编码器将视觉标记压缩为紧凑的状态表示，以便于高效的策略学习。剪枝策略通过学习示范进行初始化，随后使用近端策略优化（Proximal Policy Optimization, PPO）进行微调，以共同优化任务准确性和计算效率。我们的实验结果表明，TPRL能够去除高达66.7\%的视觉标记，并在推理过程中实现高达54.2\\%的FLOPs减少，同时保持平均准确率仅下降0.7\\%，几乎无损。代码已发布在 extcolor{mypink}{https://github.com/MagicVicCoder/TPRL}。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2603.13395

COT-FM: Cluster-wise Optimal Transport Flow Matching

COT-FM：聚类优化传输流匹配

Chiang, Chiensheng, Tu, Kuan-Hsun, Liao, Jia-Wei, Chou, Cheng-Fu, Ke, Tsung-Wei

Abstract

We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batchwise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image generation benchmarks, and robotic manipulation tasks.

Chinese Translation

我们提出了COT-FM，这是一个通用框架，通过重塑流匹配（Flow Matching, FM）中的概率路径，以实现更快且更可靠的生成。FM模型通常由于随机或批量耦合而产生弯曲轨迹，这增加了离散化误差并降低了样本质量。COT-FM通过对目标样本进行聚类，并为每个聚类分配一个通过反转预训练FM模型获得的专用源分布来解决这个问题。这种分而治之的策略产生了更准确的局部传输，并显著使向量场更为笔直，且无需改变模型架构。作为一种即插即用的方法，COT-FM在2D数据集、图像生成基准和机器人操作任务中始终加速采样并提高生成质量。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2603.13396

SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation

SERUM：一种简单、高效、稳健且统一的扩散基础图像生成标记方法

Kociszewski, Jan, Jastrzębski, Hubert, Stępkowski, Tymoteusz, Manijak, Filip, Rojek, Krzysztof, Boenisch, Franziska, Dziedzic, Adam

Abstract

We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.

Chinese Translation

我们提出了SERUM：一种引人注目的简单但极为有效的扩散模型（DMs）生成图像标记方法。我们仅在初始扩散生成噪声中添加独特的水印噪声，并训练一个轻量级检测器来识别带水印的图像，从而简化并统一了先前方法的优点。SERUM对任何图像增强或水印去除攻击具有稳健性，并且极为高效，同时对图像质量的影响微乎其微。与先前的方法相比，这些方法通常仅对有限的扰动具有韧性，并且在训练、注入和检测方面成本高昂，而我们的SERUM在大多数场景中实现了显著的性能，在1%的假阳性率（FPR）下达到了最高的真正阳性率（TPR），同时具有快速的注入和检测以及低检测器训练开销。其解耦架构还无缝支持多个用户，通过嵌入个性化水印而相互之间干扰极小。总体而言，我们的方法为标记DMs输出提供了一种实用的解决方案，并能够可靠地区分生成图像与自然图像。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2603.13397

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

TennisExpert：迈向专家级分析的体育视频理解

Liu, Zhaoyu, Weng, Xi, Hu, Lianyu, Hou, Zhe, Jiang, Kan, Dong, Jin Song, Liu, Yang

Abstract

Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.

Chinese Translation

网球是最受欢迎的运动之一，产生了大量的广播视频，具有进行专业分析、自动化教练和实时评论的强大潜力。然而，由于两个主要挑战，自动网球理解仍然未得到充分探索：（1）缺乏具有细粒度注释和专家级评论的大规模基准数据集，以及（2）构建适合实时部署的准确且高效的多模态系统的困难。为了解决这些挑战，我们引入了TennisVL，这是一个大规模的网球基准，包含超过200场职业比赛（471.9小时）和40,000多个回合级别的片段。与现有的侧重于逐步叙述的评论数据集不同，TennisVL强调捕捉战术推理、球员决策和比赛动态的专家分析评论。此外，我们提出了TennisExpert，一个多模态网球理解框架，它将视频语义解析器与基于Qwen3-VL-8B构建的增强记忆模型集成在一起。解析器提取关键比赛元素（例如，得分、击球序列、球弹跳和球员位置），而层次记忆模块则捕捉短期和长期的时间上下文。实验表明，TennisExpert在捕捉战术背景和比赛动态的能力上，始终优于强大的专有基准，包括GPT-5、Gemini和Claude。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2603.13398

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Qianfan-OCR：一个统一的端到端文档智能模型

Dong, Daxiang, Zheng, Mingming, Xu, Dong, Luo, Chunhua, Zhuang, Bairong, Li, Yuxuan, He, Ruoyun, Wang, Haoran, Zhang, Wenyu, Wang, Wenbo, Wang, Yicheng, Xiong, Xue, Zheng, Ayong, Zuo, Xiaoying, Ou, Ziwei, Gu, Jingnan, Guo, Quanhao, Wu, Jianmin, Yin, Dawei, Shen, Dou

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Chinese Translation

我们提出了Qianfan-OCR，这是一个具有40亿参数的端到端视觉-语言模型，统一了文档解析、布局分析和文档理解于单一架构中。该模型能够直接进行图像到Markdown的转换，并支持多种基于提示的任务，包括表格提取、图表理解、文档问答和关键信息提取。为了解决端到端OCR中显式布局分析的缺失，我们提出了Layout-as-Thought，这是一种由特殊思考标记触发的可选思考阶段，生成结构化的布局表示——边界框、元素类型和阅读顺序——在产生最终输出之前，恢复布局基础能力，同时提高复杂布局下的准确性。Qianfan-OCR在OmniDocBench v1.5（93.12）和OlmOCR Bench（79.8）中在端到端模型中排名第一，在OCRBench、CCOCR、DocVQA和ChartQA上与同规模的一般视觉语言模型相比，取得了竞争性结果，并在公共关键信息提取基准测试中获得最高平均分，超越了Gemini-3.1-Pro、Seed-2.0和Qwen3-VL-235B。该模型可通过百度AI云Qianfan平台公开访问。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2603.13399

FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving

FlowAD：用于自主驾驶的自我场景交互建模

Guo, Mingzhe, Yang, Yixiang, Han, Chuanrong, Zhang, Rufeng, Li, Shirui, Wan, Ji, Zhang, Zhipeng

Abstract

Effective environment modeling is the foundation for autonomous driving, underpinning tasks from perception to planning. However, current paradigms often inadequately consider the feedback of ego motion to the observation, which leads to an incomplete understanding of the driving process and consequently limits the planning capability. To address this issue, we introduce a novel ego-scene interactive modeling paradigm. Inspired by human recognition, the paradigm represents ego-scene interaction as the scene flow relative to the ego-vehicle. This conceptualization allows for modeling ego-motion feedback within a feature learning pattern, advantageously utilizing existing log-replay datasets rather than relying on scenario simulations. We specifically propose FlowAD, a general flow-based framework for autonomous driving. Within it, an ego-guided scene partition first constructs basic flow units to quantify scene flow. The ego-vehicle's forward direction and steering velocity directly shape the partition, which reflects ego motion. Then, based on flow units, spatial and temporal flow predictions are performed to model dynamics of scene flow, encompassing both spatial displacement and temporal variation. The final task-aware enhancement exploits learned spatio-temporal flow dynamics to benefit diverse tasks through object and region-level strategies. We also propose a novel Frames before Correct Planning (FCP) metric to assess the scene understanding capability. Experiments in both open and closed-loop evaluations demonstrate FlowAD's generality and effectiveness across perception, end-to-end planning, and VLM analysis. Notably, FlowAD reduces 19% collision rate over SparseDrive with FCP improvements of 1.39 frames (60%) on nuScenes, and achieves an impressive driving score of 51.77 on Bench2Drive, proving the superiority. Code, model, and configurations will be released here.

Chinese Translation

有效的环境建模是自主驾驶的基础，支撑着从感知到规划的各项任务。然而，目前的范式往往未能充分考虑自我运动对观察的反馈，这导致对驾驶过程的理解不完整，从而限制了规划能力。为了解决这一问题，我们提出了一种新颖的自我场景交互建模范式。该范式受到人类识别的启发，将自我场景交互表示为相对于自我车辆的场景流。这种概念化允许在特征学习模式中建模自我运动反馈，利用现有的日志重放数据集，而不是依赖场景模拟。我们特别提出了FlowAD，这是一个用于自主驾驶的通用流基框架。在该框架中，自我引导的场景分区首先构建基本流单元，以量化场景流。自我车辆的前进方向和转向速度直接影响分区，反映自我运动。然后，基于流单元，进行空间和时间流预测，以建模场景流的动态，包括空间位移和时间变化。最终的任务感知增强利用学习到的时空流动态，通过物体和区域级策略来惠及多种任务。我们还提出了一种新颖的“修正规划前帧数”（Frames before Correct Planning，FCP）指标，以评估场景理解能力。在开放和闭环评估中的实验表明，FlowAD在感知、端到端规划和VLM分析方面的通用性和有效性。值得注意的是，FlowAD在nuScenes上将碰撞率降低了19%，FCP改进了1.39帧（60%），并在Bench2Drive上取得了51.77的出色驾驶评分，证明了其优越性。代码、模型和配置将在此发布。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2603.13400

Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net

结合显微镜数据和元数据，通过混合视觉变换器-U-Net重建细胞牵引力

Huang, Yunfei, Van der Vorst, Elena, Richard, Alexander, Sabass, Benedikt

Abstract

Traction force microscopy (TFM) is a widely used technique for quantifying the forces that cells exert on their surrounding extracellular matrix. Although deep learning methods have recently been applied to TFM data analysis, several challenges remain-particularly achieving reliable inference across multiple spatial scales and integrating additional contextual information such as cell type to improve accuracy. In this study, we propose ViT+UNet, a robust deep learning architecture that integrates a U-Net with a Vision Transformer. Our results demonstrate that this hybrid model outperforms both standalone U-Net and Vision Transformer architectures in predicting traction force fields. Furthermore, ViT+UNet exhibits superior generalization across diverse spatial scales and varying noise levels, enabling its application to TFM datasets obtained from different experimental setups and imaging systems. By appropriately structuring the input data, our approach also allows the inclusion of metadata, in our case cell-type information, to enhance prediction specificity and accuracy.

Chinese Translation

牵引力显微镜（TFM）是一种广泛应用于量化细胞对其周围细胞外基质施加的力的技术。尽管最近深度学习方法已被应用于TFM数据分析，但仍存在一些挑战，特别是在多个空间尺度上实现可靠推断以及整合额外的上下文信息（如细胞类型）以提高准确性。在本研究中，我们提出了ViT+UNet，这是一种将U-Net与视觉变换器（Vision Transformer）相结合的强大深度学习架构。我们的结果表明，该混合模型在预测牵引力场方面优于独立的U-Net和视觉变换器架构。此外，ViT+UNet在不同空间尺度和变化噪声水平下表现出更好的泛化能力，使其能够应用于来自不同实验设置和成像系统的TFM数据集。通过适当构建输入数据，我们的方法还允许包含元数据，在我们的案例中是细胞类型信息，以增强预测的特异性和准确性。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2603.13401

MAD: Microenvironment-Aware Distillation -- A Pretraining Strategy for Virtual Spatial Omics from Microscopy

MAD：微环境感知蒸馏——一种用于显微镜虚拟空间组学的预训练策略

Han, Jiashu, Liu, Kunzan, Kim, Yeojin, Sinha, Saurabh, You, Sixian

Abstract

Bridging microscopy and omics would allow us to read molecular states from images-at single-cell resolution and tissue scale-without the cost and throughput limits of omics technologies. Self-supervised pretraining offers a scalable approach with minimal labels, yet how to encode single-cell identity within tissue environments-and the extent of biological information such models can capture-remains an open question. Here, we introduce MAD (microenvironment-aware distillation), a pretraining strategy that learns cell-centric embeddings by jointly self-distilling the morphology view and the microenvironment view of the same indexed cell into a unified embedding space. Across diverse tissues and imaging modalities, MAD achieves state-of-the-art prediction performance on downstream tasks including cell subtyping, transcriptomic prediction, and bioinformatic inference. MAD even outperforms foundation models with a similar number of model parameters that have been trained on substantially larger datasets. These results demonstrate that MAD's dual-view joint self-distillation effectively captures the complexity and diversity of cells within tissues. Together, this establishes MAD as a general tool for representation learning in microscopy, enabling virtual spatial omics and biological insights from vast microscopy datasets.

Chinese Translation

将显微镜技术与组学结合，将使我们能够在单细胞分辨率和组织尺度下，从图像中读取分子状态，而无需组学技术的成本和通量限制。自监督预训练提供了一种可扩展的方法，仅需最少的标签，但如何在组织环境中编码单细胞身份，以及此类模型能够捕获的生物信息的程度，仍然是一个未解之谜。在此，我们介绍了MAD（微环境感知蒸馏），这是一种通过将同一索引细胞的形态视图和微环境视图共同自蒸馏到统一嵌入空间中来学习以细胞为中心的嵌入的预训练策略。在多种组织和成像模式下，MAD在下游任务（包括细胞亚型分类、转录组预测和生物信息推断）上实现了最先进的预测性能。MAD甚至超越了在更大数据集上训练的具有相似参数数量的基础模型。这些结果表明，MAD的双视图联合自蒸馏有效地捕捉了组织内细胞的复杂性和多样性。总之，这确立了MAD作为显微镜中表征学习的一种通用工具，使得从庞大的显微镜数据集中获得虚拟空间组学和生物学见解成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2603.13402

Event-Driven Video Generation

事件驱动的视频生成

Maduabuchi, Chika

Abstract

State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.

Chinese Translation

最先进的文本到视频模型往往在逐帧上看起来逼真，但在简单交互上却表现不佳：运动在接触之前开始，动作未能实现，物体在放置后漂移，支撑关系破裂。我们认为这源于帧优先去噪，这种方法在每一步更新潜在状态时并没有明确交互何时何地处于活动状态。我们提出了事件驱动的视频生成（Event-Driven Video Generation, EVD），这是一个与DiT兼容的最小框架，使得采样基于事件：一个轻量级的事件头预测与标记对齐的事件活动，事件驱动的损失在训练过程中将活动与状态变化相结合，而事件门控采样（结合滞后和早期步骤调度）抑制虚假更新，同时在交互期间集中更新。在EVD-Bench上，EVD始终改善人类偏好和VBench动态，显著减少状态持续性、空间准确性、支撑关系和接触稳定性方面的失败模式，而不牺牲外观。这些结果表明，明确的事件基础是减少视频生成中交互幻觉的实用抽象。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2603.13403

Diabetic Retinopathy Grading with CLIP-based Ranking-Aware Adaptation:A Comparative Study on Fundus Image

基于CLIP的排名感知适应的糖尿病视网膜病变分级：一项关于眼底图像的比较研究

Cho, Sungjun

Abstract

Diabetic retinopathy (DR) is a leading cause of preventable blindness, and automated fundus image grading can play an important role in large-scale screening. In this work, we investigate three CLIP-based approaches for five-class DR severity grading: (1) a zero-shot baseline using prompt engineering, (2) a hybrid FCN-CLIP model augmented with CBAM attention, and (3) a ranking-aware prompting model that encodes the ordinal structure of DR progression. We train and evaluate on a combined dataset of APTOS 2019 and Messidor-2 (n=5,406), addressing class imbalance through resampling and class-specific optimal thresholding. Our experiments show that the ranking-aware model achieves the highest overall accuracy (93.42%, AUROC 0.9845) and strong recall on clinically critical severe cases, while the hybrid FCN-CLIP model (92.49%, AUROC 0.99) excels at detecting proliferative DR. Both substantially outperform the zero-shot baseline (55.17%, AUROC 0.75). We analyze the complementary strengths of each approach and discuss their practical implications for screening contexts.

Chinese Translation

糖尿病视网膜病变（DR）是可预防失明的主要原因，自动化眼底图像分级在大规模筛查中可以发挥重要作用。在本研究中，我们探讨了三种基于CLIP的方法用于五级DR严重程度分级：（1）使用提示工程的零样本基线，（2）增强了CBAM注意力的混合FCN-CLIP模型，以及（3）编码DR进展的序数结构的排名感知提示模型。我们在APTOS 2019和Messidor-2的结合数据集上进行训练和评估（n=5,406），通过重采样和类特定的最佳阈值处理类不平衡。实验结果表明，排名感知模型在整体准确率上达到最高（93.42%，AUROC 0.9845），并在临床关键的严重病例中表现出强大的召回率，而混合FCN-CLIP模型（92.49%，AUROC 0.99）在检测增殖性DR方面表现优异。两者均显著优于零样本基线（55.17%，AUROC 0.75）。我们分析了每种方法的互补优势，并讨论了它们在筛查环境中的实际应用意义。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2603.13405

Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

锚点强制：用于交互式流媒体视频扩散的锚点记忆和三区域 RoPE

Yang, Yang, Zhang, Tianyi, Huang, Wei, Chen, Jinwei, Wu, Boxi, He, Xiaofei, Cai, Deng, Li, Bo, Jiang, Peng-Tao

Abstract

Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing

Chinese Translation

交互式长视频生成需要快速切换以引入新主题或事件，同时在较长时间范围内保持感知保真度和连贯运动。最近的蒸馏流媒体视频扩散模型通过重用滚动的 KV 缓存实现长距离生成，使得在每次切换时通过重新缓存实现快速切换交互。然而，现有的流媒体方法仍然表现出逐渐的质量下降和减弱的运动动态。我们识别出交互式流媒体生成特有的两种失败模式：（i）在每次提示切换时，当前的缓存维护无法同时保留基于 KV 的语义上下文和最近的潜在线索，导致边界条件弱化和感知质量降低；（ii）在蒸馏过程中，无界时间索引导致与预训练主干的有界 RoPE 机制的位置信息分布偏移，削弱了预训练运动先验和长时间运动保留。为了解决这些问题，我们提出了 extbf{锚点强制}，一个以缓存为中心的框架，具有两种设计。首先，锚点引导的重新缓存机制将 KV 状态存储在锚点缓存中，并在每次提示切换时从这些锚点热启动重新缓存，减少切换后的证据损失并稳定感知质量。其次，具有区域特定参考原点的三区域 RoPE，以及 RoPE 重新对齐蒸馏，将无界流媒体索引与预训练 RoPE 机制协调，以更好地保留运动先验。对长视频的实验表明，我们的方法在交互式设置中提高了感知质量和运动指标，相较于先前的流媒体基线表现更佳。项目页面： https://github.com/vivoCameraResearch/Anchor-Forcing

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2603.13406

Nuanced Emotion Recognition Based on a Segment-based MLLM Framework Leveraging Qwen3-Omni for AH Detection

基于分段 MLLM 框架的细腻情感识别：利用 Qwen3-Omni 进行情感矛盾检测

Tang, Liang, Li, Hongda, Zhang, Jiayu, Chen, Long, Li, Shuxian, Pei, Siqi, Duan, Tiaonan, Cheng, Yuhao

Abstract

Emotion recognition in videos is a pivotal task in affective computing, where identifying subtle psychological states such as Ambivalence and Hesitancy holds significant value for behavioral intervention and digital health. Ambivalence and Hesitancy states often manifest through cross-modal inconsistencies such as discrepancies between facial expressions, vocal tones, and textual semantics, posing a substantial challenge for automated recognition. This paper proposes a recognition framework that integrates temporal segment modeling with Multimodal Large Language Models. To address computational efficiency and token constraints in long video processing, we employ a segment-based strategy, partitioning videos into short clips with a maximum duration of 5 seconds. We leverage the Qwen3-Omni-30B-A3B model, fine-tuned on the BAH dataset using LoRA and full-parameter strategies via the MS-Swift framework, enabling the model to synergistically analyze visual and auditory signals. Experimental results demonstrate that the proposed method achieves an accuracy of 85.1% on the test set, significantly outperforming existing benchmarks and validating the superior capability of Multimodal Large Language Models in capturing complex and nuanced emotional conflicts. The code is released at https://github.com/dlnn123/A-H-Detection-with-Qwen-Omni.git.

Chinese Translation

视频中的情感识别是情感计算中的一项关键任务，识别诸如矛盾和犹豫等微妙心理状态对于行为干预和数字健康具有重要价值。矛盾和犹豫状态通常通过跨模态不一致性表现出来，例如面部表情、语音语调和文本语义之间的差异，这对自动识别提出了重大挑战。本文提出了一种识别框架，将时间段建模与多模态大型语言模型相结合。为了解决长视频处理中的计算效率和令牌限制问题，我们采用了一种基于分段的策略，将视频划分为最大持续时间为 5 秒的短片段。我们利用 Qwen3-Omni-30B-A3B 模型，该模型在 BAH 数据集上通过 LoRA 和全参数策略进行微调，并通过 MS-Swift 框架实现，使模型能够协同分析视觉和听觉信号。实验结果表明，所提出的方法在测试集上达到了 85.1% 的准确率，显著超越现有基准，验证了多模态大型语言模型在捕捉复杂和细腻情感冲突方面的优越能力。代码已发布在 https://github.com/dlnn123/A-H-Detection-with-Qwen-Omni.git。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2603.13410

Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis

弥合视觉与物理之间的差距: 用于跌倒风险分析的物理对齐表示

Zhang, Xianqi

Abstract

Vision-based fall analysis has advanced rapidly, but a key bottleneck remains: visually similarmotions can correspond to very different physical outcomes because small differences in contactmechanics and protective responses are hard to infer from appearance alone. Most existingapproaches handle this by supervised injury prediction, which depends on reliable injury labels.In practice, such labels are difficult to obtain: video evidence is often ambiguous (occlusion,viewpoint limits), and true injury events are rare and cannot be safely staged, leading to noisysupervision. We address this problem with PHARL (PHysics-aware Alignment RepresentationLearning), which learns physically meaningful fall representations without requiring clinicaloutcome labels. PHARL regularizes motion embeddings with two complementary constraints:(1) trajectory-level temporal consistency for stable representation learning, and (2) multi-classphysics alignment, where simulation-derived contact outcomes shape embedding geometry. Bypairing video windows with temporally aligned simulation descriptors, PHARL captures localimpact-relevant dynamics while keeping inference purely feed-forward. Experiments on fourpublic datasets show that PHARL consistently improves risk-aligned representation quality overvisual-only baselines while maintaining strong fall-detection performance. Notably, PHARL alsoexhibits zero-shot ordinality: an interpretable severity structure (Head > Trunk > Supported)emerges without explicit ordinal supervision.

Chinese Translation

基于视觉的跌倒分析发展迅速，但仍然存在一个关键瓶颈：视觉上相似的动作可能对应于非常不同的物理结果，因为接触力学和保护反应的小差异难以仅通过外观推断。现有的大多数方法通过监督性伤害预测来解决这一问题，而这依赖于可靠的伤害标签。在实践中，这种标签难以获得：视频证据通常存在模糊性（遮挡、视角限制），而真正的伤害事件稀少且无法安全地重现，导致监督噪声。我们通过PHARL（PHysics-aware Alignment Representation Learning）来解决这个问题，PHARL能够学习具有物理意义的跌倒表示，而无需临床结果标签。PHARL通过两个互补约束来规范运动嵌入：（1）轨迹级时间一致性以实现稳定的表示学习，以及（2）多类物理对齐，其中来自模拟的接触结果塑造嵌入几何。通过将视频窗口与时间上对齐的模拟描述符配对，PHARL捕捉局部影响相关的动态，同时保持推理的纯前馈特性。在四个公共数据集上的实验表明，PHARL在提高风险对齐表示质量方面始终优于仅基于视觉的基线，同时保持强大的跌倒检测性能。值得注意的是，PHARL还展现了零样本序列性：一个可解释的严重性结构（头部 > 核心 > 支撑）在没有显式序列监督的情况下出现。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2603.13412

WAT: Online Video Understanding Needs Watching Before Thinking

WAT：在线视频理解需要在思考之前观看

Han, Zifan, Sun, Hongbo, Xu, Jinglin, Tang, Canhui, Lei, Yulong, Zhang, Xuchong, Sun, Hongbin, He, Zhongjiang, Sun, Hao

Abstract

Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.

Chinese Translation

多模态大型语言模型（MLLMs）在图像理解方面展现出了强大的能力，这激励了近期将其扩展到视频推理的努力。然而，现有的视频大型语言模型在在线流媒体场景中表现不佳，因为在严格的内存限制下必须保持长时间的上下文。我们提出了WAT（Watching Before Thinking），一种用于在线视频推理的两阶段框架。WAT将处理分为一个与查询无关的观看阶段和一个由查询触发的思考阶段。观看阶段构建了一个层次化的记忆系统，其中短期记忆（STM）用于缓冲最近的帧，而固定容量的长期记忆（LTM）则使用关注冗余的驱逐策略维护历史内容的多样化摘要。在思考阶段，上下文感知检索机制将查询与当前的STM上下文结合，以从LTM中检索相关的历史帧进行跨时间推理。为了支持在线视频任务的训练，我们引入了WAT-85K，一个包含流媒体风格注释的数据集，强调实时感知、反向追踪和预测。实验表明，WAT在在线视频基准测试中实现了最先进的性能，包括在StreamingBench上达到77.7%的准确率，在OVO-Bench上达到55.2%，超越了现有的开源在线视频大型语言模型，同时以实时帧率运行。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2603.13415

Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation

基于距离感知的软提示学习用于多模态情感-唤醒估计

Jung, Byeongjin, Park, Chanyeong, Lim, Sejoon

Abstract

Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained emotional transitions. For multimodal integration, our architecture utilizes a CLIP image encoder and an Audio Spectrogram Transformer (AST) to extract robust spatial and acoustic features. These features are temporally modeled via Gated Recurrent Units (GRUs) and integrated through a hierarchical fusion scheme that sequentially combines cross-modal attention for alignment and gated fusion for adaptive refinement. Experimental results on the Aff-Wild2 dataset demonstrate that our proposed semantic-guided approach significantly enhances the accuracy of VA estimation, achieving competitive performance in unconstrained ``in-the-wild'' scenarios.

Chinese Translation

情感-唤醒（VA）估计对于捕捉自然环境中人类情感的细微特征至关重要。尽管预训练的视觉-语言模型如CLIP在语义对齐能力上表现出色，但其在连续回归任务中的应用常常受到文本提示离散性质的限制。在本文中，我们提出了一种新颖的多模态框架用于VA估计，该框架引入了基于距离感知的软提示学习，以弥合语义空间与连续维度之间的差距。具体而言，我们将VA空间划分为3X3的网格，定义九个情感区域，每个区域与特定的文本描述相关联。我们采用高斯核而非硬分类，根据真实坐标与区域中心之间的欧几里得距离计算软标签，从而使模型能够学习细致的情感过渡。为了实现多模态集成，我们的架构利用CLIP图像编码器和音频谱图变换器（AST）提取稳健的空间和声学特征。这些特征通过门控递归单元（GRUs）进行时间建模，并通过分层融合方案进行集成，该方案依次结合跨模态注意力进行对齐和门控融合进行自适应精细化。在Aff-Wild2数据集上的实验结果表明，我们提出的基于语义引导的方法显著提高了VA估计的准确性，在无约束的“野外”场景中达到了具有竞争力的表现。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2603.13427

MIBench: Evaluating LMMs on Multimodal Interaction

MIBench：评估大型多模态模型在多模态交互中的表现

Miao, Yu, Yang, Zequn, Wei, Yake, Chen, Ziheng, Ni, Haotian, Duan, Haodong, Chen, Kai, Hu, Di

Abstract

In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.

Chinese Translation

在不同的多模态场景中，需要根据任务的需求以特定方式整合和利用跨模态的信息。不同的模态间整合方式被称为“多模态交互”。模型处理各种多模态交互的能力在很大程度上表征了其多模态能力。在本文中，我们介绍了MIBench，这是一个综合基准，旨在评估大型多模态模型（LMMs）的多模态交互能力。该基准将每个实例表述为一个包含视觉和文本上下文的（con_v, con_t, task）三元组，要求LMMs采用正确的多模态交互形式以有效完成任务。MIBench从三个关键方面评估模型：从视觉中心或文本中心线索中获取信息的能力，以及从它们的联合协同中生成新信息的能力。每种交互能力在三个认知层次上进行分层评估：识别、理解和推理。MIBench包含超过10,000对视觉-文本上下文，涵盖32个不同的任务。对最先进的LMMs的评估表明：（1）尽管模型参数和训练数据的规模不断扩大，LMMs在多模态交互方面的能力仍然受到限制；（2）在处理视觉信息时，它们容易受到文本模态的干扰；（3）它们大多仅具备基本的多模态协同能力；（4）原生训练的多模态模型在基本交互能力上存在明显不足。我们期望这些观察结果能够为未来开发具有更强多模态能力的LMMs提供参考。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2603.13429

A Deformable Attention-Based Detection Transformer with Cross-Scale Feature Fusion for Industrial Coil Spring Inspection

一种基于可变形注意力的检测变换器，结合跨尺度特征融合用于工业弹簧检查

Rossi, Matteo, Matt, Pony

Abstract

Automated visual inspection of locomotive coil springs presents significant challenges due to the morphological diversity of surface defects, substantial scale variations, and complex industrial backgrounds. This paper proposes MSD-DETR (Multi-Scale Deformable Detection Transformer), a novel detection framework that addresses these challenges through three key innovations: (1) a structural re-parameterization strategy that decouples training-time multi-branch topology from inference-time efficiency, enhancing feature extraction while maintaining real-time performance; (2) a deformable attention mechanism that enables content-adaptive spatial sampling, allowing dynamic focus on defect-relevant regions regardless of morphological irregularity; and (3) a cross-scale feature fusion architecture incorporating GSConv modules and VoVGSCSP blocks for effective multi-resolution information aggregation. Comprehensive experiments on a real-world locomotive coil spring dataset demonstrate that MSD-DETR achieves 92.4\% [email protected] at 98 FPS, outperforming state-of-the-art detectors including YOLOv8 (+3.1\% mAP) and the baseline RT-DETR (+2.8\% mAP) while maintaining comparable inference speed, establishing a new benchmark for industrial coil spring quality inspection.

Chinese Translation

机车弹簧的自动视觉检测面临着显著的挑战，主要由于表面缺陷的形态多样性、显著的尺度变化以及复杂的工业背景。本文提出了一种新的检测框架MSD-DETR（多尺度可变形检测变换器），通过三个关键创新来应对这些挑战：（1）一种结构重参数化策略，将训练时的多分支拓扑与推理时的效率解耦，增强特征提取的同时保持实时性能；（2）一种可变形注意力机制，使得内容自适应的空间采样成为可能，能够动态聚焦于与缺陷相关的区域，无论其形态不规则性如何；（3）一种跨尺度特征融合架构，结合GSConv模块和VoVGSCSP块，有效聚合多分辨率信息。在真实世界的机车弹簧数据集上的全面实验表明，MSD-DETR在98 FPS下达到了92.4%的[email protected]，超越了包括YOLOv8（+3.1% mAP）和基线RT-DETR（+2.8% mAP）在内的最先进检测器，同时保持了可比的推理速度，为工业弹簧质量检测建立了新的基准。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2603.13432

Spatial Transcriptomics as Images for Large-Scale Pretraining

空间转录组学的图像化用于大规模预训练

Zhu, Yishun, Qi, Jiaxin, Wang, Jian, Zheng, Yuhua, Huang, Jianqiang

Abstract

Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.

Chinese Translation

空间转录组学（Spatial Transcriptomics, ST）在组织切片的离散位置以精确坐标描绘成千上万的基因表达值，保留了对临床和病理研究至关重要的空间上下文。随着测序通量的上升和平台的进步，数据量的扩大刺激了大规模ST预训练的需求。然而，用于预训练的基本单元——即构成单个训练样本的内容——尚未明确。目前的选择可分为两类：（1）将每个位置视为独立样本，这样忽略了空间依赖性，并将ST简化为单细胞转录组学；（2）将整个载玻片视为单个样本，这会生成过于庞大的输入数据，并显著减少训练样本，削弱有效的预训练。为了解决这一问题，我们建议将空间转录组学视作可裁剪的图像。具体而言，我们通过从原始载玻片裁剪补丁来定义具有固定空间尺寸的多通道图像表示，从而保留空间上下文，同时显著增加训练样本的数量。在线通道维度上，我们定义基因子集选择规则以控制输入维度，并提高预训练的稳定性。大量实验表明，所提出的类似图像的数据集构建方法在ST预训练中始终提升了下游性能，超过了传统的预训练方案。消融研究验证了空间补丁和通道设计的必要性，为组织ST数据建立了统一且实用的范式，进而实现大规模预训练。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2603.13435

CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

CtrlAttack：一种针对扩散模型中世界模型控制的统一攻击

Xu, Shuhan, Liang, Siyuan, Zheng, Hongling, Luo, Yong, Hu, Han, Zhang, Lefei, Tao, Dacheng

Abstract

Diffusion-based image-to-video (I2V) models increasingly exhibit world-model-like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory-control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low-dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model's state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white-box and black-box attack settings. Experimental results show that even under low-dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white-box setting and over 80% in the black-box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.

Chinese Translation

基于扩散的图像到视频（I2V）模型日益显示出类似世界模型的特性，能够隐式捕获时间动态。然而，现有研究主要集中于视觉质量和可控性，而模型学习的状态转变的鲁棒性仍然没有得到充分研究。为了填补这一空白，我们首次分析了I2V模型的脆弱性，发现时间控制机制构成了新的攻击面，并揭示了在不同攻击设置下对其进行统一建模的挑战。基于此，我们提出了一种称为CtrlAttack的轨迹控制攻击，用于干扰生成过程中的状态演变。具体而言，我们将扰动表示为低维速度场，并通过时间积分构建连续位移场，从而影响模型的状态转变，同时保持时间一致性；与此同时，我们将扰动映射到观察空间，使得该方法适用于白盒和黑盒攻击设置。实验结果表明，即便在低维和强约束的扰动限制下，我们的方法仍能显著破坏时间一致性，使攻击成功率（ASR）在白盒设置下超过90%，在黑盒设置下超过80%，同时将FID和FVD的变化控制在6和130以内，从而揭示了I2V模型在状态动态层面的潜在安全风险。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2603.13437

Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection

基于视觉-语言的专家报告用于绘画鉴定和缺陷检测

Ouda, Eman, Salah, Mohammed, Chulkov, Arsenii O., Gargiulo, Gianfranco, Tartaglia, Gian Luca, Sfarra, Stefano, Abdulrahman, Yusra

Abstract

Authenticity and condition assessment are central to conservation decision-making, yet interpretation and reporting of thermographic output remain largely bespoke and expert-dependent, complicating comparison across collections and limiting systematic integration into conservation documentation. Pulsed Active Infrared Thermography (AIRT) is sensitive to subsurface features such as material heterogeneity, voids, and past interventions; however, its broader adoption is constrained by artifact misinterpretation, inter-laboratory variability, and the absence of standardized, explainable reporting frameworks. Although multi-modal thermographic processing techniques are established, their integration with structured natural-language interpretation has not been explored in cultural heritage. A fully automated thermography-vision-language model (VLM) framework is presented. It combines multi-modal AIRT analysis with modality-aware textual reporting, without human intervention during inference. Thermal sequences are processed using Principal Component Thermography (PCT), Thermographic Signal Reconstruction (TSR), and Pulsed Phase Thermography (PPT), and the resulting anomaly masks are fused into a consensus segmentation that emphasizes regions supported by multiple thermal indicators while mitigating boundary artifacts. The fused evidence is provided to a VLM, which generates structured reports describing the location of the anomaly, thermal behavior, and plausible physical interpretations while explicitly acknowledging the uncertainty and diagnostic limitations. Evaluation on two marquetries demonstrates consistent anomaly detection and stable structured interpretations, indicating reproducibility and generalizability across samples.

Chinese Translation

真实性和状态评估是保护决策的核心，然而热成像输出的解释和报告在很大程度上仍然是定制的且依赖专家，这使得跨藏品的比较变得复杂，并限制了系统性地整合到保护文档中。脉冲主动红外热成像（AIRT）对材料异质性、空洞和过去干预等表面下特征敏感；然而，其更广泛的应用受到文物误解、实验室间变异性和缺乏标准化、可解释的报告框架的限制。尽管多模态热成像处理技术已建立，但其与结构化自然语言解释的结合在文化遗产领域尚未得到探索。本文提出了一种完全自动化的热成像-视觉-语言模型（VLM）框架。该框架结合了多模态AIRT分析与感知模态的文本报告，在推理过程中无需人工干预。热序列通过主成分热成像（PCT）、热成像信号重建（TSR）和脉冲相位热成像（PPT）进行处理，生成的异常掩膜被融合成一个共识分割，强调由多个热指标支持的区域，同时减轻边界伪影。融合的证据被提供给VLM，生成结构化报告，描述异常的位置、热行为和合理的物理解释，同时明确承认不确定性和诊断限制。在两个镶嵌作品上的评估表明一致的异常检测和稳定的结构化解释，表明在样本间的可重复性和普遍性。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2603.13438

Draft-and-Target Sampling for Video Generation Policy

用于视频生成策略的草拟与目标采样

Zhang, Qikang, Lei, Yingjie, Liu, Wei, Liu, Daochang

Abstract

Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.

Chinese Translation

视频生成模型被用作机器人策略，以预测在任务描述和观察条件下执行任务的未来状态。先前的研究忽视了其高计算成本和较长的推理时间。为了解决这一挑战，我们提出了草拟与目标采样（Draft-and-Target Sampling），这是一种新型的扩散推理范式，针对视频生成策略，具有无训练的特点，并能够提升推理效率。我们通过在单一模型中利用两个互补的去噪轨迹，提出了一种自我对弈去噪方法：草拟采样（draft sampling）以较大步长快速生成全局轨迹，而目标采样（target sampling）则以较小步长对其进行验证。为了进一步加速生成，我们引入了令牌分块和渐进接受策略，以减少冗余计算。在三个基准测试上的实验表明，我们的方法在提高当前最先进方法的效率方面，可以实现高达2.1倍的加速，且对成功率的影响最小。我们的代码已公开。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2603.13450

LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

LADR：基于局部性意识的动态救援方法，用于高效的文本到图像生成与扩散大语言模型

Wang, Chenglin, Zhou, Yucheng, Chen, Shawn, Wang, Tao, Zhang, Kai

Abstract

Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.

Chinese Translation

离散扩散语言模型已成为统一多模态生成的一个引人注目的范式，但由于迭代解码导致的高推理延迟，其部署受到限制。现有的加速策略通常需要昂贵的重新训练，或未能利用视觉数据中固有的二维空间冗余。为了解决这一问题，我们提出了基于局部性意识的动态救援（Locality-Aware Dynamic Rescue, LADR），这是一种无训练的方法，通过利用图像的空间马尔可夫特性来加速推理。LADR优先恢复位于“生成前沿”的标记，即与观察到的像素空间相邻的区域，从而最大化信息增益。具体而言，我们的方法整合了形态邻居识别以定位候选标记，采用风险界限过滤机制以防止错误传播，并利用流形一致的逆调度将扩散轨迹与加速的掩模密度对齐。在四个文本到图像生成基准上的广泛实验表明，我们的LADR相较于标准基线实现了约4倍的加速。值得注意的是，它保持或甚至增强了生成的保真度，特别是在空间推理任务中，提供了效率与质量之间的最佳权衡。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2603.13497

Synthetic Melanoma Image Generation and Evaluation Using Generative Adversarial Networks

基于生成对抗网络的合成黑色素瘤图像生成与评估

Lin, Pei-Yu, Shen, Yidan, Mathew, Neville, Hu, Renjie, Huang, Siyu, Queen, Courtney M., West, Cameron E., Ciurea, Ana, Zouridakis, George

Abstract

Melanoma is the most lethal form of skin cancer, and early detection is critical for improving patient outcomes. Although dermoscopy combined with deep learning has advanced automated skin-lesion analysis, progress is hindered by limited access to large, well-annotated datasets and by severe class imbalance, where melanoma images are substantially underrepresented. To address these challenges, we present the first systematic benchmarking study comparing four GAN architectures-DCGAN, StyleGAN2, and two StyleGAN3 variants (T/R)-for high-resolution melanoma-specific synthesis. We train and optimize all models on two expert-annotated benchmarks (ISIC 2018 and ISIC 2020) under unified preprocessing and hyperparameter exploration, with particular attention to R1 regularization tuning. Image quality is assessed through a multi-faceted protocol combining distribution-level metrics (FID), sample-level representativeness (FMD), qualitative dermoscopic inspection, downstream classification with a frozen EfficientNet-based melanoma detector, and independent evaluation by two board-certified dermatologists. StyleGAN2 achieves the best balance of quantitative performance and perceptual quality, attaining FID scores of 24.8 (ISIC 2018) and 7.96 (ISIC 2020) at gamma=0.8. The frozen classifier recognizes 83% of StyleGAN2-generated images as melanoma, while dermatologists distinguish synthetic from real images at only 66.5% accuracy (chance = 50%), with low inter-rater agreement (kappa = 0.17). In a controlled augmentation experiment, adding synthetic melanoma images to address class imbalance improved melanoma detection AUC from 0.925 to 0.945 on a held-out real-image test set. These findings demonstrate that StyleGAN2-generated melanoma images preserve diagnostically relevant features and can provide a measurable benefit for mitigating class imbalance in melanoma-focused machine learning pipelines.

Chinese Translation

黑色素瘤是最致命的皮肤癌形式，早期检测对改善患者预后至关重要。尽管皮肤镜检查结合深度学习已推动自动化皮肤病变分析的发展，但由于缺乏大型、良好标注的数据集以及严重的类别不平衡（黑色素瘤图像显著不足），进展受到限制。为了解决这些挑战，我们首次进行系统的基准测试研究，比较四种生成对抗网络（GAN）架构——DCGAN、StyleGAN2，以及两个StyleGAN3变体（T/R）——用于高分辨率黑色素瘤特定合成。我们在两个专家标注的基准数据集（ISIC 2018和ISIC 2020）上对所有模型进行训练和优化，采用统一的预处理和超参数探索，特别关注R1正则化的调优。图像质量通过多方面的协议进行评估，结合了分布级别指标（FID）、样本级别代表性（FMD）、定性皮肤镜检查、使用冻结的基于EfficientNet的黑色素瘤检测器进行的下游分类，以及两位获得认证的皮肤科医生的独立评估。StyleGAN2在定量性能和感知质量之间取得了最佳平衡，在gamma=0.8时，FID得分为24.8（ISIC 2018）和7.96（ISIC 2020）。冻结的分类器将83%的StyleGAN2生成图像识别为黑色素瘤，而皮肤科医生仅以66.5%的准确率（机会 = 50%）区分合成图像与真实图像，且评估者之间的一致性较低（kappa = 0.17）。在一个受控的增强实验中，添加合成黑色素瘤图像以解决类别不平衡问题，使黑色素瘤检测的AUC从0.925提高到0.945，基于保留的真实图像测试集。这些发现表明，StyleGAN2生成的黑色素瘤图像保留了诊断相关特征，并可以为缓解黑色素瘤相关机器学习管道中的类别不平衡提供可测量的益处。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2603.13500

ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

ActionPlan：通过帧级动作规划实现未来感知的流式运动合成

Nazarenus, Eric, Li, Chuqiao, He, Yannan, Xie, Xianghui, Lenssen, Jan Eric, Pons-Moll, Gerard

Abstract

We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.

Chinese Translation

我们提出了ActionPlan，一个统一的运动扩散框架，旨在将实时流媒体与高质量离线生成结合在一个模型中。其核心思想是引入每帧的动作计划：模型预测帧级文本潜变量，这些潜变量在去噪过程中充当密集的语义锚点，并利用它们结合语义和运动线索对完整的运动序列进行去噪。为了支持这种结构化的工作流程，我们设计了潜变量特定的扩散步骤，使得每个运动潜变量可以独立去噪，并在推理时以灵活的顺序进行采样。因此，ActionPlan能够以历史条件和未来感知的模式进行实时流媒体，同时也支持高质量的离线生成。相同的机制进一步实现了零-shot运动编辑和插帧，而无需额外的模型。实验表明，我们的实时流媒体速度提高了5.25倍，同时在FID方面相比于最佳的先前方法实现了18%的运动质量提升。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2603.13506

LibraGen: Playing a Balance Game in Subject-Driven Video Generation

LibraGen：在主题驱动的视频生成中玩平衡游戏

Zhu, Jiahao, Lao, Shanshan, Liu, Lijie, Li, Gen, Qi, Tianhao, Han, Wei, Li, Bingchuan, Liu, Fangfang, Chen, Zhuowei, Ma, Tianxiang, HE, Qian, Zhou, Yi, Xie, Xiaohua

Abstract

With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

Chinese Translation

随着视频生成基础模型（VGFMs）的进步，定制化生成，特别是主题到视频（S2V），引起了越来越多的关注。然而，一个关键挑战在于平衡VGFM的内在先验，如运动一致性、视觉美学和提示对齐，以及其新衍生的S2V能力。现有方法往往通过增强一个方面而忽视这种平衡，导致其他方面的牺牲。为了解决这个问题，我们提出了LibraGen，一个新颖的框架，将扩展基础模型用于S2V生成视为内在VGFM优势与S2V能力之间的平衡游戏。具体而言，在“提升支点，调节平衡”的核心理念指导下，我们将数据质量视为支点，并倡导优质优先于数量的方法。我们构建了一个混合管道，结合自动和手动数据过滤，以提高整体数据质量。为了进一步协调VGFM的原生能力与其S2V扩展，我们引入了一种调节平衡的后训练范式。在监督微调过程中，交叉对和同对数据被纳入，并采用模型合并以实现有效的权衡。随后，我们设计并合并了两个量身定制的直接偏好优化（DPO）管道，即Consis-DPO和Real-Fake DPO，以巩固这种平衡。在推理过程中，我们引入了一种时间依赖的动态无分类器引导方案，以实现灵活和细致的控制。实验结果表明，LibraGen在仅使用千规模训练数据的情况下，优于现有的开源和商业S2V模型。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2603.13507

MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection

MIRAGE：模型无关的工业现实异常生成与评估用于视觉异常检测

Hu, Jinwei, Borsatti, Francesco, Stropeni, Arianna, Pezze, Davide Dalle, Barusco, Manuel, Susto, Gian Antonio

Abstract

Industrial visual anomaly detection (VAD) methods are typically trained on normal samples only, yet performance improves substantially when even limited anomalous data is available. Existing anomaly generation approaches either require real anomalous examples, demand expensive hardware, or produce synthetic defects that lack realism. We present MIRAGE (Model-agnostic Industrial Realistic Anomaly Generation and Evaluation), a fully automated pipeline for realistic anomalous image generation and pixel-level mask creation that requires no training and no anomalous images. Our pipeline accesses any generative model as a black box via API calls, uses a VLM for automatic defect prompt generation, and includes a CLIP-based quality filter to retain only well-aligned generated images. For mask generation at scale, we introduce a lightweight, training-free dual-branch semantic change detection module combining text-conditioned Grounding DINO features with fine-grained YOLOv26-Seg structural features. We benchmark four generation methods using Gemini 2.5 Flash Image (Nano Banana) as the generative backbone, evaluating performance on MVTec AD and VisA across two distinct tasks: (i) downstream anomaly segmentation and (ii) visual quality of the generated images, assessed via standard metrics (IS, IC-LPIPS) and a human perceptual study involving 31 participants and 1,550 pairwise votes. The results demonstrate that MIRAGE offers a scalable, accessible foundation for anomaly-aware industrial inspection that requires no real defect data. As a final contribution, we publicly release a large-scale dataset comprising 500 image-mask pairs per category for every MVTec AD and VisA class, over 13,000 pairs in total, alongside all generation prompts and pipeline code.

Chinese Translation

工业视觉异常检测（VAD）方法通常仅在正常样本上进行训练，但当有限的异常数据可用时，性能会显著提高。现有的异常生成方法要么需要真实的异常示例，要么需要昂贵的硬件，或者生成缺乏真实感的合成缺陷。我们提出了MIRAGE（模型无关的工业现实异常生成与评估），这是一个完全自动化的管道，用于生成现实的异常图像和像素级掩膜创建，无需训练和异常图像。我们的管道通过API调用将任何生成模型作为黑箱访问，使用视觉语言模型（VLM）自动生成缺陷提示，并包括基于CLIP的质量过滤器，仅保留对齐良好的生成图像。为了大规模生成掩膜，我们引入了一个轻量级、无训练的双分支语义变化检测模块，将文本条件的Grounding DINO特征与细粒度的YOLOv26-Seg结构特征相结合。我们使用Gemini 2.5 Flash Image（Nano Banana）作为生成骨干，基准测试四种生成方法，在MVTec AD和VisA上评估性能，涉及两个不同的任务：（i）下游异常分割和（ii）生成图像的视觉质量，通过标准指标（IS，IC-LPIPS）和一项涉及31名参与者和1,550对投票的人类感知研究进行评估。结果表明，MIRAGE为异常感知的工业检测提供了一个可扩展、可访问的基础，无需真实缺陷数据。作为最后的贡献，我们公开发布了一个大规模数据集，每个MVTec AD和VisA类别包含500个图像-掩膜对，总计超过13,000对，并附上所有生成提示和管道代码。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2603.13520

A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis

MRI到CT合成的GAN架构系统基准测试

Pesci, Alessandro, Guarrasi, Valerio, Alì, Marco, Castiglioni, Isabella, Soda, Paolo

Abstract

The translation from Magnetic resonance imaging (MRI) to Computed tomography (CT) has been proposed as an effective solution to facilitate MRI-only clinical workflows while limiting exposure to ionizing radiation. Although numerous Generative Adversarial Network (GAN) architectures have been proposed for MRI-to-CT translation, systematic and fair comparisons across heterogeneous models remain limited. We present a comprehensive benchmark of ten GAN architectures evaluated on the SynthRAD2025 dataset across three anatomical districts (abdomen, thorax, head-and-neck). All models were trained under a unified validation protocol with identical preprocessing and optimization settings. Performance was assessed using complementary metrics capturing voxel-wise accuracy, structural fidelity, perceptual quality, and distribution-level realism, alongside an analysis of computational complexity. Supervised Paired models consistently outperformed Unpaired approaches, confirming the importance of voxel-wise supervision. Pix2Pix achieved the most balanced performance across districts while maintaining a favorable quality-to-complexity trade-off. Multi-district training improved structural robustness, whereas intra-district training maximized voxel-wise fidelity. This benchmark provides quantitative and computational guidance for model selection in MRI-only radiotherapy workflows and establishes a reproducible framework for future comparative studies. To ensure the reproducibility of our experiments we make our code public, together with the overall results, at the following link:https://github.com/arco-group/MRI_TO_CT.git

Chinese Translation

将磁共振成像（MRI）转换为计算机断层扫描（CT）被提出作为一种有效的解决方案，以促进仅使用MRI的临床工作流程，同时限制对电离辐射的暴露。尽管已有众多生成对抗网络（GAN）架构被提出用于MRI到CT的转换，但对异构模型进行系统和公平的比较仍然有限。我们展示了在SynthRAD2025数据集上评估的十种GAN架构的综合基准，涵盖三个解剖区域（腹部、胸部、头颈部）。所有模型均在统一的验证协议下训练，采用相同的预处理和优化设置。通过捕捉体素级准确性、结构保真度、感知质量和分布级现实性的互补指标来评估性能，同时分析计算复杂性。监督配对模型在性能上始终优于非配对方法，确认了体素级监督的重要性。Pix2Pix在各个区域之间实现了最平衡的性能，同时保持了良好的质量与复杂度的权衡。多区域训练提高了结构稳健性，而区域内训练则最大化了体素级保真度。该基准为MRI仅放疗工作流程中的模型选择提供了定量和计算指导，并为未来的比较研究建立了可重复的框架。为了确保我们实验的可重复性，我们将代码及总体结果公开，链接如下：https://github.com/arco-group/MRI_TO_CT.git

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2603.13521

Eleven Primitives and Three Gates: The Universal Structure of Computational Imaging

十一种基本元素与三种门控：计算成像的普遍结构

Yang, Chengshuai, Yuan, Xin

Abstract

Computational imaging systems -- from coded-aperture cameras to cryo-electron microscopes -- span five carrier families yet share a hidden structural simplicity. We prove that every imaging forward model decomposes into a directed acyclic graph over exactly 11 physically typed primitives (Finite Primitive Basis Theorem) -- a sufficient and minimal basis that provides a compositional language for designing any imaging modality. We further prove that every reconstruction failure has exactly three independent root causes: information deficiency, carrier noise, and operator mismatch (Triad Decomposition). The three gates map to the system lifecycle: Gates 1 and 2 guide design (sampling geometry, carrier selection); Gate 3 governs deployment-stage calibration and drift correction. Validation across 12 modalities and all five carrier families confirms both results, with +0.8 to +13.9 dB recovery on deployed instruments. Together, the 11 primitives and 3 gates establish the first universal grammar for designing, diagnosing, and correcting computational imaging systems.

Chinese Translation

计算成像系统——从编码孔径相机到冷冻电子显微镜——涵盖了五种载体家族，但却共享一种隐秘的结构简单性。我们证明了每个成像前向模型都可以分解为一个有向无环图，包含恰好11种物理类型的基本元素（有限基本元素基础定理）——这是一个充分且最小的基础，为设计任何成像方式提供了组成语言。我们进一步证明，每个重建失败都有恰好三个独立的根本原因：信息不足、载体噪声和操作不匹配（三元分解）。这三种门控映射到系统生命周期：门控1和门控2指导设计（采样几何、载体选择）；门控3则管理部署阶段的校准和漂移修正。在12种成像方式和所有五种载体家族中的验证结果确认了这两个结论，在已部署仪器上实现了+0.8到+13.9 dB的恢复。总的来说，这11种基本元素和3种门控建立了设计、诊断和修正计算成像系统的首个通用语法。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2603.13524

Hide and Seek: Investigating Redundancy in Earth Observation Imagery

藏与寻：探讨地球观测影像中的冗余性

Papazafeiropoulos, Tasos, Bountos, Nikolaos Ioannis, Papadopoulos, Nikolas, Papoutsis, Ioannis

Abstract

The growing availability of Earth Observation (EO) data and recent advances in Computer Vision have driven rapid progress in machine learning for EO, producing domain-specific models at ever-increasing scales. Yet this progress risks overlooking fundamental properties of EO data that distinguish it from other domains. We argue that EO data exhibit a multidimensional redundancy (spectral, temporal, spatial, and semantic) which has a more pronounced impact on the domain and its applications than what current literature reflects. To validate this hypothesis, we conduct a systematic domain-specific investigation examining the existence, consistency, and practical implications of this phenomenon across key dimensions of EO variability. Our findings confirm that redundancy in EO data is both substantial and pervasive: exploiting it yields comparable performance ($\approx98.5\%$ of baseline) at a fraction of the computational cost ($\approx4\times$ fewer GFLOPs), at both training and inference. Crucially, these gains are consistent across tasks, geospatial locations, sensors, ground sampling distances, and architectural designs; suggesting that multi-faceted redundancy is a structural property of EO data rather than an artifact of specific experimental choices. These results lay the groundwork for more efficient, scalable, and accessible large-scale EO models.

Chinese Translation

地球观测（EO）数据的日益可用性以及计算机视觉的最新进展推动了机器学习在EO领域的快速发展，产生了越来越大规模的领域特定模型。然而，这一进展有可能忽视EO数据的基本特性，这些特性使其与其他领域区分开来。我们认为，EO数据表现出多维冗余（光谱、时间、空间和语义），这种冗余对该领域及其应用的影响比当前文献所反映的更为显著。为了验证这一假设，我们进行了一项系统的领域特定调查，考察这一现象在EO变异性关键维度上的存在性、一致性和实际影响。我们的研究结果确认，EO数据中的冗余既显著又普遍：利用这一冗余可以在训练和推理阶段以较低的计算成本（约减少4倍的GFLOPs）实现与基线相当的性能（约98.5%）。重要的是，这些收益在任务、地理位置、传感器、地面采样距离和架构设计等方面是一致的；这表明多方面的冗余是EO数据的结构性特征，而不是特定实验选择的产物。这些结果为更高效、可扩展和可访问的大规模EO模型奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2603.13533

SAIF: A Stability-Aware Inference Framework for Medical Image Segmentation with Segment Anything Model

SAIF：一种基于稳定性的医学图像分割推理框架，结合Segment Anything模型

Wu, Ke, Chen, Shiqi, Zhong, Yiheng, Liu, Hengxian, Su, Yingxue, Wang, Yifang, Jin, Junhao, Ren, Guangyu

Abstract

Segment Anything Model (SAM) enable scalable medical image segmentation but suffer from inference-time instability when deployed as a frozen backbone. In practice, bounding-box prompts often contain localization errors, and fixed threshold binarization introduces additional decision uncertainty. These factors jointly cause high prediction variance, especially near object boundaries, degrading reliability. We propose the Stability-Aware Inference Framework (SAIF), a training-free and plug-and-play inference framework that improves robustness by explicitly modeling prompt and threshold uncertainty. SAIF constructs a joint uncertainty space via structured box perturbations and threshold variations, evaluates each hypothesis using decision stability and boundary consistency, and introduces a stability-consistency score to filter unstable candidates and perform stability-weighted fusion in probability space. Experiments on Synapse, CVC-ClinicDB, Kvasir-SEG, and CVC-300 demonstrate that SAIF consistently improves segmentation accuracy and robustness, achieving state-of-the-art performance without retraining or architectural modification. Our anonymous code is released at https://anonymous.4open.science/r/SAIF.

Chinese Translation

Segment Anything模型（SAM）能够实现可扩展的医学图像分割，但在作为固定骨干网络部署时，推理过程存在不稳定性。在实际应用中，边界框提示常常包含定位误差，而固定阈值二值化则引入了额外的决策不确定性。这些因素共同导致高预测方差，尤其是在物体边界附近，降低了可靠性。我们提出了稳定性意识推理框架（SAIF），这是一种无训练且即插即用的推理框架，通过明确建模提示和阈值的不确定性来提高鲁棒性。SAIF通过结构化的框扰动和阈值变化构建联合不确定性空间，利用决策稳定性和边界一致性评估每个假设，并引入稳定性一致性评分来过滤不稳定候选，并在概率空间中进行稳定性加权融合。在Synapse、CVC-ClinicDB、Kvasir-SEG和CVC-300上的实验表明，SAIF始终提高了分割精度和鲁棒性，达到了最先进的性能，而无需重新训练或修改架构。我们的匿名代码已发布在 https://anonymous.4open.science/r/SAIF。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2603.13547

NumColor: Precise Numeric Color Control in Text-to-Image Generation

NumColor：文本到图像生成中的精确数值颜色控制

Butt, Muhammad Atif, Hernandez, Diego, Gomez-Villa, Alexandra, Wang, Kai, Vazquez-Corral, Javier, Van De Weijer, Joost

Abstract

Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-{\alpha}, and PixArt-{\Sigma} without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.

Chinese Translation

文本到图像的扩散模型在根据自然语言描述生成图像方面表现出色，但在解释数值颜色（如十六进制代码 #FF5733 和 RGB 值 (rgb(255,87,51))）方面存在不足。这一局限性源于子词标记化，它将颜色代码分割成语义上无意义的标记，导致文本编码器无法将其映射到一致的颜色表示。我们提出了 NumColor，它能够在多种扩散架构中实现精确的数值颜色控制。NumColor 包含两个组件：一个颜色标记聚合器（Color Token Aggregator），能够检测颜色规范而不受标记化的影响，以及一个包含 6,707 个可学习嵌入的颜色书（ColorBook），将颜色映射到感知均匀的 CIE Lab 空间中的文本编码器嵌入空间。我们引入了两个辅助损失：方向对齐和插值一致性，以强制 Lab 空间与嵌入空间之间的几何对应关系，从而实现平滑的颜色插值。为了训练颜色书，我们构建了 NumColor-Data，这是一个包含 50 万张渲染图像的合成数据集，具有明确的颜色与像素对应关系，消除了摄影数据集中固有的标注歧义。尽管仅在 FLUX 上进行训练，NumColor 仍能无模型特定适应地零-shot 转移到 SD3、SD3.5、PixArt-α 和 PixArt-Σ。NumColor 在五个模型上提高了数值颜色的准确性 4-9 倍，同时在 GenColorBench 基准上将颜色和谐度评分提高了 10-30 倍。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2603.13556

Semantic Aware Feature Extraction for Enhanced 3D Reconstruction

语义感知特征提取以增强三维重建

Nap, Ronald, Xiao, Andy

Abstract

Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.

Chinese Translation

特征匹配是计算机视觉中的一个基本问题，具有广泛的应用，包括同时定位与地图构建（SLAM）、图像拼接和三维重建。尽管深度学习的最新进展改善了关键点的检测和描述，但大多数方法主要关注几何属性，往往忽视了更高层次的语义信息。本研究提出了一种语义感知特征提取框架，采用多任务学习共同训练关键点检测、关键点描述和语义分割。该方法与标准特征匹配技术进行了基准测试，并在三维重建的背景下进行了评估。为了增强特征对应性，集成了深度匹配模块。系统使用安装在车辆上的单目鱼眼相机的输入进行测试，并在多层停车结构中进行评估。所提出的方法支持带有高度估计的语义三维重建，捕捉高度变化并实现多层次映射。实验结果表明，该方法生成了具有改进的结构细节和高度信息的语义标注三维点云，强调了利用语义线索进行联合训练在更一致的特征匹配和增强三维重建中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2603.13557

Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics

深度学习模型在图像分析中的性能评估：视觉控制和统计指标的考虑

Bertram, Christof A., Ammeling, Jonas, Bartel, Alexander, Beamer, Gillian, Aubreville, Marc

Abstract

Deep learning-based automated image analysis (DL-AIA) has been shown to outperform trained pathologists in tasks related to feature quantification. Related to these capacities the use of DL-AIA tools is currently extending from proof-of-principle studies to routine applications such as patient samples (diagnostic pathology), regulatory safety assessment (toxicologic pathology), and recurrent research tasks. To ensure that DL-AIA applications are safe and reliable, it is critical to conduct a thorough and objective generalization performance assessment (i.e., the ability of the algorithm to accurately predict patterns of interest) and possibly evaluate model robustness (i.e., the algorithm's capacity to maintain predictive accuracy on images from different sources). In this article, we review the practices for performance assessment in veterinary pathology publications by which two approaches were identified: 1) Exclusive visual performance control (i.e. eyeballing of algorithmic predictions) plus validation of the models application utilizing secondary performance indices, and 2) Statistical performance control (alongside the other methods), which requires a dataset creation and separation of an hold-out test set prior to model training. This article compares the strengths and weaknesses of statistical and visual performance control methods. Furthermore, we discuss relevant considerations for rigorous statistical performance evaluation including metric selection, test dataset image composition, ground truth label quality, resampling methods such as bootstrapping, statistical comparison of multiple models, and evaluation of model stability. It is our conclusion that visual and statistical evaluation have complementary strength and a combination of both provides the greatest insight into the DL model's performance and sources of error.

Chinese Translation

基于深度学习的自动化图像分析（DL-AIA）已被证明在特征量化相关任务中优于受过训练的病理学家。与这些能力相关，DL-AIA工具的使用目前正从原理验证研究扩展到常规应用，如患者样本（诊断病理学）、监管安全评估（毒理病理学）和重复研究任务。为了确保DL-AIA应用的安全性和可靠性，进行全面和客观的泛化性能评估（即算法准确预测感兴趣模式的能力）至关重要，并可能评估模型的鲁棒性（即算法在来自不同来源的图像上保持预测准确性的能力）。在本文中，我们回顾了兽医病理学出版物中性能评估的实践，识别出两种方法：1）独占的视觉性能控制（即对算法预测的目测评估）加上利用次级性能指标验证模型应用，以及2）统计性能控制（与其他方法并行），这需要在模型训练之前创建数据集并分离出保留测试集。本文比较了统计和视觉性能控制方法的优缺点。此外，我们讨论了严格的统计性能评估的相关考虑，包括指标选择、测试数据集图像组成、真实标签质量、重采样方法如自助法、多个模型的统计比较以及模型稳定性的评估。我们得出的结论是，视觉和统计评估具有互补的优势，两者的结合提供了对深度学习模型性能和误差来源的最佳洞察。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2603.13571

DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models

DiveUp：从多样化视觉基础模型中学习特征上采样

Liu, Xiaoqiong, Fan, Heng

Abstract

Recently, feature upsampling has gained increasing attention owing to its effectiveness in enhancing vision foundation models (VFMs) for pixel-level understanding tasks. Existing methods typically rely on high-resolution features from the same foundation model to achieve upsampling via self-reconstruction. However, relying solely on intra-model features forces the upsampler to overfit to the source model's inherent location misalignment and high-norm artifacts. To address this fundamental limitation, we propose DiveUp, a novel framework that breaks away from single-model dependency by introducing multi-VFM relational guidance. Instead of naive feature fusion, DiveUp leverages diverse VFMs as a panel of experts, utilizing their structural consensus to regularize the upsampler's learning process, effectively preventing the propagation of inaccurate spatial structures from the source model. To reconcile the unaligned feature spaces across different VFMs, we propose a universal relational feature representation, formulated as a local center-of-mass (COM) field, that extracts intrinsic geometric structures, enabling seamless cross-model interaction. Furthermore, we introduce a spikiness-aware selection strategy that evaluates the spatial reliability of each VFM, effectively filtering out high-norm artifacts to aggregate guidance from only the most reliable expert at each local region. DiveUp is a unified, encoder-agnostic framework; a jointly-trained model can universally upsample features from diverse VFMs without requiring per-model retraining. Extensive experiments demonstrate that DiveUp achieves state-of-the-art performance across various downstream dense prediction tasks, validating the efficacy of multi-expert relational guidance. Our code and models are available at: https://github.com/Xiaoqiong-Liu/DiveUp

Chinese Translation

近年来，特征上采样因其在增强视觉基础模型（VFM）以进行像素级理解任务中的有效性而受到越来越多的关注。现有方法通常依赖于来自同一基础模型的高分辨率特征，通过自我重建实现上采样。然而，仅依赖模型内部特征使得上采样器容易过拟合于源模型固有的位置不对齐和高范数伪影。为了解决这一根本性限制，我们提出了DiveUp，一个新颖的框架，通过引入多VFM关系指导，打破了对单一模型的依赖。DiveUp并非简单的特征融合，而是利用多样化的VFM作为专家面板，利用它们的结构共识来规范上采样器的学习过程，有效防止源模型中不准确空间结构的传播。为了调和不同VFM之间不对齐的特征空间，我们提出了一种通用关系特征表示，形式化为局部质心（COM）场，提取内在几何结构，实现无缝的跨模型交互。此外，我们引入了一种关注尖锐度的选择策略，评估每个VFM的空间可靠性，有效过滤高范数伪影，仅在每个局部区域聚合来自最可靠专家的指导。DiveUp是一个统一的、编码器无关的框架；一个联合训练的模型可以普遍地从多样化的VFM中上采样特征，而无需对每个模型进行重新训练。大量实验表明，DiveUp在各种下游密集预测任务中实现了最先进的性能，验证了多专家关系指导的有效性。我们的代码和模型可在以下网址获取：https://github.com/Xiaoqiong-Liu/DiveUp

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2603.13573

Analytical Logit Scaling for High-Resolution Sea Ice Topology Retrieval from Weakly Labeled SAR Imagery

基于解析Logit缩放的高分辨率海冰拓扑从弱标记SAR图像中提取

Elwaradi, Reda, Gimenez, Julien, Hordoir, Stéphane, Hamma, Mehdi Ait, Chan-Hon-Tong, Adrien, Weissgerber, Flora

Abstract

High-resolution sea ice mapping using Synthetic Aperture Radar (SAR) is crucial for Arctic navigation and climate monitoring. However, operational ice charts provide only coarse, region-level polygons (weak labels), forcing automated segmentation models to struggle with pixel-level accuracy and often yielding under-confident, blurred concentration maps. In this paper, we propose a weakly supervised deep learning pipeline that fuses Sentinel-1 SAR and AMSR-2 radiometry data using a U-Net architecture trained with a region-based loss. To overcome the severe under-confidence caused by weak labels, we introduce an Analytical Logit Scaling method applied post-inference. By dynamically calculating the temperature and bias based on the latent space percentiles (2\% and 98\%) of each scene, we force a physical binarization of the predictions. This adaptive scaling acts as a topological extractor, successfully revealing fine-grained sea ice fractures (leads) at a 40-meter resolution without requiring any manual pixel-level annotations. Our approach not only resolves local topology but also perfectly preserves regional macroscopic concentrations, achieving a 78\% accuracy on highly fragmented summer scenes, thereby bridging the gap between weakly supervised learning and high-resolution physical segmentation.

Chinese Translation

利用合成孔径雷达（SAR）进行高分辨率海冰制图对于北极航行和气候监测至关重要。然而，现有的作业冰图仅提供粗略的区域级多边形（弱标签），这使得自动分割模型在像素级准确性上面临挑战，常常导致生成不自信且模糊的浓度图。在本文中，我们提出了一种弱监督深度学习管道，该管道使用基于区域损失训练的U-Net架构融合了Sentinel-1 SAR和AMSR-2辐射计数据。为了克服弱标签导致的严重不自信问题，我们引入了一种在推理后应用的解析Logit缩放方法。通过基于每个场景的潜在空间百分位数（2 ext{%}和98 ext{%}）动态计算温度和偏差，我们强制对预测结果进行物理二值化。这种自适应缩放作为拓扑提取器，成功揭示了40米分辨率下的细粒度海冰裂缝（裂缝），而无需任何手动的像素级标注。我们的方法不仅解决了局部拓扑问题，还完美保留了区域宏观浓度，在高度破碎的夏季场景中实现了78 ext{%}的准确率，从而弥合了弱监督学习与高分辨率物理分割之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2603.13578

LingoMotion: An Interpretable and Unambiguous Symbolic Representation for Human Motion

LingoMotion：一种可解释且明确的人类运动符号表示

Zhang, Yao, Liu, Zhuchenyang, Xiao, Yu

Abstract

Existing representations for human motion, such as MotionGPT, often operate as black-box latent vectors with limited interpretability and build on joint positions which can cause ambiguity. Inspired by the hierarchical structure of natural languages - from letters to words, phrases, and sentences - we propose LingoMotion, a motion language that facilitates interpretable and unambiguous symbolic representation for both simple and complex human motion. In this paper, we introduce the concept design of LingoMotion, including the definitions of motion alphabet based on joint angles, the morphology for forming words and phrases to describe simple actions like walking and their attributes like speed and scale, as well as the syntax for describing more complex human activities with sequences of words and phrases. The preliminary results, including the implementation and evaluation of motion alphabet using a large-scale motion dataset Motion-X, demonstrate the high fidelity of motion representation.

Chinese Translation

现有的人类运动表示，如 MotionGPT，通常作为黑箱潜在向量运作，缺乏可解释性，并且基于关节位置，这可能导致歧义。受到自然语言的层次结构启发——从字母到单词、短语和句子——我们提出了 LingoMotion，这是一种运动语言，便于对简单和复杂的人类运动进行可解释且明确的符号表示。本文介绍了 LingoMotion 的概念设计，包括基于关节角度的运动字母表定义、用于形成单词和短语以描述简单动作（如行走）及其属性（如速度和规模）的形态，以及用于描述更复杂人类活动的语法，这些活动由单词和短语的序列构成。初步结果，包括使用大规模运动数据集 Motion-X 实现和评估运动字母表，展示了运动表示的高保真度。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2603.13590

Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations

机会主义心脏健康评估：通过多模态表征从定位磁共振成像中估计表型

Zeybek, Busra Nur, Turgut, Özgün, Zhang, Yundi, Pan, Jiazhen, Graf, Robert, Starck, Sophie, Rueckert, Daniel, Kafali, Sevgi Gokce

Abstract

Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.

Chinese Translation

心血管疾病是导致死亡的主要原因。心脏表型（CPs），例如射血分数，是评估心脏健康的金标准，但这些表型是通过心脏磁共振成像（CMR）获得的，该过程成本高且需要高时空分辨率。每次磁共振（MR）检查都以快速且粗略的定位扫描开始，以进行扫描规划，而这些定位图像在之后会被丢弃。尽管定位图像的诊断质量较差且缺乏时间信息，但它们可以快速提供有价值的结构信息。除了影像学，患者级别的信息，包括人口统计学和生活方式，也会影响心脏健康评估。心电图（ECGs）成本低廉，临床实践中常规开具，并能捕捉心脏的时间活动。在此，我们介绍C-TRIP（心脏三模态表征用于影像表型），这是一个多模态框架，旨在对齐定位磁共振成像、心电信号和表格元数据，以学习一个稳健的潜在空间，并使用定位图像作为CMR的机会主义替代来预测CPs。通过结合这三种模态，我们利用来自定位图像的低成本空间和时间信息，以及来自心电图的时间信息，同时受益于表格数据提供的患者特定上下文。我们的流程包括三个阶段。首先，独立训练编码器以学习单模态表征。第二阶段融合预训练的编码器以统一潜在空间。最后阶段利用丰富的表征空间进行CP预测，推理仅在定位磁共振成像上进行。所提出的C-TRIP能够产生准确的功能性CPs，并对结构性CPs具有高相关性。由于定位图像本质上快速且成本低廉，我们的C-TRIP框架可能会提高CP估计的可及性。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2603.13609

A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

基于网格的电动滑板车需求表示与深度学习的时间输入设计框架：来自德克萨斯州奥斯丁的证据

Demissie, Mohammad Sahnoon Merkebe Getachew, Souza, Roberto

Abstract

Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.

Chinese Translation

尽管在共享微移动需求预测的深度学习方面取得了一定进展，但时间输入结构的系统设计和统计验证仍然未得到充分探索。时间特征通常是通过启发式方法选择的，尽管历史需求对模型性能和泛化能力有着重要影响。本文介绍了一种可重复的数据处理流程和一种基于统计的时间输入结构设计方法，用于图像到图像的需求预测。利用来自德克萨斯州奥斯丁的大规模电动滑板车数据，我们通过将行程记录转换为每小时的接送需求图像，构建了一个基于网格的时空数据集。该流程包括行程过滤、将普查区映射到空间位置、网格构建、需求聚合，以及创建一个全球活动掩码，以限制评估在历史活跃区域内进行。这种表示方法支持一致的空间学习，同时保留需求模式。然后，我们引入了一种结合相关性和误差的程序来识别信息丰富的历史输入。通过使用基线 UNET 模型进行消融研究，采用配对非参数检验和霍尔姆校正选择最佳时间深度。最终的时间结构捕捉了短期持续性以及日常和每周周期。与相邻小时和固定周期基线相比，所提出的设计在下一小时预测中将均方误差降低了多达 37%，在下一 24 小时预测中降低了 35%。这些结果突显了原则性数据集构建和经过统计验证的时间输入设计在时空微移动需求预测中的价值。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2603.13615

Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis

用于真实感手物交互合成的自我中心世界模型

Li, Dayou, Liu, Lulin, Liu, Bangya, Zhou, Shijie, Feng, Jiu, Lu, Ziqi, Zheng, Minghui, You, Chenyu, Fan, Zhiwen

Abstract

To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.

Chinese Translation

为了作为具身人工智能的可扩展数据源，世界模型应作为真正的模拟器，仅根据用户的动作推断交互动态，而不是依赖于特权的未来物体状态的简单条件视频生成。在这种背景下，自我中心的人物-物体交互（HOI）世界模型对于预测物理基础的第一人称展开至关重要。然而，构建这样的模型面临着巨大的挑战，因为快速的头部运动、严重的遮挡以及高自由度的手部关节运动会突然改变接触拓扑。因此，现有的方法通常通过依赖已知的未来物体轨迹的条件视频生成来规避这些物理挑战。我们提出了EgoHOI，一个自我中心的HOI世界模型，摆脱了这种捷径，仅凭动作信号模拟真实感且接触一致的交互。为了确保在没有未来状态输入的情况下保持物理准确性，EgoHOI从3D估计中提炼几何和运动学先验，形成物理信息嵌入。这些嵌入使自我中心的展开朝着物理有效的动态进行正则化。在HOT3D数据集上的实验表明，相较于强基线模型，EgoHOI consistently 提升了性能，而消融实验验证了我们物理信息设计的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2603.13628

Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models

基于可定位性引导的自适应推理在图像地理定位中的应用：视觉-语言模型的探索

Yu, Bo, Yang, Fengze, Liu, Yiming, Wang, Chao, Luo, Xuewen, Li, Taozhe, Ke, Ruimin, Zhou, Xiaofan, Liu, Chenxi

Abstract

The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image's suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.

Chinese Translation

视觉-语言模型（VLMs）的出现为通过检索增强生成（RAG）和推理驱动推断进行全球图像地理定位引入了新的范式。然而，RAG 方法受到检索数据库质量的限制，而推理驱动的方法未能内化图像的可定位性，依赖于低效的固定深度推理路径，这增加了幻觉现象并降低了准确性。为克服这些局限性，我们引入了一种优化的可定位性评分，该评分量化了图像在地理定位中进行深度推理的适宜性。利用这一指标，我们策划了 Geo-ADAPT-51K，这是一个可定位性分层的推理数据集，丰富了复杂视觉场景的增强推理轨迹。在此基础上，我们提出了一种两阶段的群体相对策略优化（GRPO）课程，配备定制的奖励函数，以调节自适应推理深度、视觉基础和层次地理准确性。我们的框架 Geo-ADAPT 学习了一种自适应推理策略，在多个地理定位基准测试中实现了最先进的性能，并通过自适应和高效的推理显著减少了幻觉现象。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2603.13652

Causal Attribution via Activation Patching

通过激活补丁进行因果归因

Izadi, Amirmohammad, Banayeeanzade, Mohammadali, Mirrokni, Alireza, Hasani, Hosein, Bagherian, Mobin, Mehri, Faridoun, Baghshah, Mahdieh Soleymani

Abstract

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.

Chinese Translation

视觉变换器（ViTs）的归因方法旨在识别影响模型预测的图像区域，但生成真实且定位良好的归因仍然具有挑战性。现有的基于梯度和基于扰动的技术往往无法孤立与单个图像补丁相关的内部表征的因果贡献。主要挑战在于，与类别相关的证据是通过跨层补丁令牌之间的交互形成的，而输入级别的扰动可能无法有效反映补丁的重要性，因为它们可能无法重建模型实际使用的内部证据。我们提出了通过激活补丁进行因果归因（Causal Attribution via Activation Patching, CAAP），该方法通过直接干预内部激活而非使用学习的掩模或合成扰动模式，来估计单个图像补丁对ViT预测的贡献。对于每个补丁，CAAP将相应的源图像激活插入中性目标上下文中，覆盖中间层的范围，并使用生成的目标类别分数作为归因信号。生成的归因图反映了与补丁相关的内部表征对模型预测的因果影响。因果干预作为补丁影响的原则性度量，通过捕捉初始表征形成后的类别相关证据，避免了可能降低空间特异性的后层全局混合。在多个ViT骨干网络和标准指标下，CAAP显著优于现有方法，并生成更真实且定位更好的归因。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2603.13659

FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures

FMS$^2$: 统一流匹配用于薄结构的分割与合成

Asadi, Babak, Wu, Peiyang, Golparvar-Fard, Mani, Shah, Viraj, Hajj, Ramez

Abstract

Segmenting thin structures like infrastructure cracks and anatomical vessels is a task hampered by topology-sensitive geometry, high annotation costs, and poor generalization across domains. Existing methods address these challenges in isolation. We propose FMS$^2$, a flow-matching framework with two modules. (1) SegFlow is a 2.96M-parameter segmentation model built on a standard encoder-decoder backbone that recasts prediction as continuous image $\rightarrow$ mask transport. It learns a time-indexed velocity field with a flow-matching regression loss and outputs the mask via ODE integration, rather than supervising only end-state logits. This trajectory-level supervision improves thin-structure continuity and sharpness, compared with tuned topology-aware loss baselines, without auxiliary topology heads, post-processing, or multi-term loss engineering. (2) SynFlow is a mask-conditioned mask $\rightarrow$ image generator that produces pixel-aligned synthetic image-mask pairs. It injects mask geometry at multiple scales and emphasizes boundary bands via edge-aware gating, while a controllable mask generator expands sparsity, width, and branching regimes. On five crack and vessel benchmarks, SegFlow alone outperforms strong CNN, Transformer, Mamba, and generative baselines, improving the volumetric metric (mean IoU) from 0.511 to 0.599 (+17.2%) and reducing the topological metric (Betti matching error) from 82.145 to 51.524 (-37.3%). When training with limited labels, augmenting SegFlow with SynFlow-generated pairs recovers near-full performance using 25% of real annotations and improves cross-domain IoU by 0.11 on average. Unlike classical data augmentation that promotes invariance via label-preserving transforms, SynFlow provides pixel-aligned paired supervision with controllable structural shifts (e.g., sparsity, width, branching), which is particularly effective under domain shift.

Chinese Translation

薄结构的分割任务，例如基础设施裂缝和解剖血管，受到拓扑敏感几何、高标注成本和跨领域泛化能力差的影响。现有方法通常孤立地解决这些挑战。我们提出了FMS$^2$，一个具有两个模块的流匹配框架。(1) SegFlow是一个具有2.96M参数的分割模型，基于标准的编码器-解码器骨干网络，将预测重新表述为连续图像$ ightarrow$掩膜传输。它通过流匹配回归损失学习一个时间索引的速度场，并通过常微分方程（ODE）积分输出掩膜，而不仅仅是监督最终状态的logits。与调优的拓扑感知损失基线相比，这种轨迹级的监督提高了薄结构的连续性和清晰度，而无需辅助拓扑头、后处理或多项损失工程。(2) SynFlow是一个掩膜条件的掩膜$ ightarrow$图像生成器，生成像素对齐的合成图像-掩膜对。它在多个尺度上注入掩膜几何，并通过边缘感知门控强调边界带，同时一个可控的掩膜生成器扩展稀疏性、宽度和分支模式。在五个裂缝和血管基准测试中，仅SegFlow就超越了强大的CNN、Transformer、Mamba和生成基线，将体积度量（平均IoU）从0.511提高到0.599（+17.2%），并将拓扑度量（Betti匹配误差）从82.145降低到51.524（-37.3%）。在有限标签的训练中，使用SynFlow生成的对增强SegFlow能够在使用25%的真实标注时恢复近乎完整的性能，并平均提高跨领域IoU 0.11。与通过保持标签不变的变换促进不变性的经典数据增强不同，SynFlow提供了像素对齐的配对监督，并具有可控的结构变化（例如，稀疏性、宽度、分支），在领域转移下特别有效。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2603.13660

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

从掩膜引导的自监督学习中学习可泛化的3D医学图像表征

Gao, Yunhe, Zhang, Yabin, Wang, Chong, Liu, Jiaming, Varma, Maya, Delbrouck, Jean-Benoit, Chaudhari, Akshay, Langlotz, Curtis

Abstract

Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.

Chinese Translation

基础模型通过从大规模无标签数据中学习通用表征，已在视觉和语言领域带来了变革，但3D医学成像尚缺乏类似的方法。现有的自监督方法依赖于低级重建或对比目标，这些方法未能捕捉对医学图像分析至关重要的解剖语义，限制了其向下游任务的迁移。我们提出了MASS（掩膜引导自监督学习），将上下文分割视为学习通用医学成像表征的前提任务。MASS的关键见解在于，自动生成的类无关掩膜为学习语义丰富的表征提供了足够的结构监督。通过在涵盖解剖结构和病理发现的数千个多样化掩膜提案上进行训练，MASS学习了医学结构的语义定义：外观、形状、空间上下文和解剖关系的整体组合。我们在不同数据模式下展示了MASS的有效性：从对单个数据集的小规模预训练（20-200个扫描）到对5000个CT、MRI和PET体积的大规模多模态预训练，均无需注释。MASS表现出：(i) 在新结构上的少量样本分割，(ii) 在仅有20-40\%标记数据的情况下与完全监督相匹配，同时在低数据模式下的Dice分数超过自监督基线20以上，(iii) 在未见病理上的冻结编码器分类与数千个样本的完全监督训练相匹配。掩膜引导的自监督预训练捕获了广泛可泛化的知识，为没有专家注释的3D医学成像基础模型开辟了道路。代码可用： https://github.com/Stanford-AIMI/MASS。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2603.13667

TSDCRF: Balancing Privacy and Multi-Object Tracking via Time-Series CRF and Normalized Control Penalty

TSDCRF：通过时间序列条件随机场和归一化控制惩罚平衡隐私与多目标跟踪

Ma, Bo, Wu, Jinsong, Yan, Weiqi

Abstract

Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,\delta)$-differential privacy via calibrated Gaussian noise on sensitive regions under a configurable privacy budget; (ii) a Normalized Control Penalty (NCP) that down-weights unstable or conflicting class predictions before noise injection to stabilize association; and (iii) a time-series dynamic conditional random field (DCRF) that enforces temporal consistency and corrects trajectory deviation after noise, mitigating ID switches and resilience to trajectory hijacking. The pipeline is agnostic to the choice of detector and tracker (e.g., YOLOv4 and DeepSORT). We evaluate on MOT16, MOT17, Cityscapes, and KITTI. Results show that TSDCRF achieves a better privacy--utility trade-off than white noise and prior methods (NTPD, PPDTSA): lower KL-divergence shift, lower tracking RMSE, and improved robustness under trajectory hijacking while preserving privacy. Source code in https://github.com/mabo1215/TSDCRF.git

Chinese Translation

视频中的多目标跟踪通常需要外观或位置线索，这些线索可能会泄露敏感的身份信息，而添加保护隐私的噪声通常会干扰跨帧关联，导致身份切换或目标丢失。我们提出了TSDCRF，这是一种插件式的精细化框架，通过结合三个组件来平衡隐私与跟踪：（i）在可配置的隐私预算下，通过对敏感区域施加校准的高斯噪声实现$( ext{ε}, ext{δ})$-差分隐私；（ii）归一化控制惩罚（NCP），在噪声注入之前对不稳定或冲突的类别预测进行降权，以稳定关联；（iii）时间序列动态条件随机场（DCRF），在噪声后强制执行时间一致性并纠正轨迹偏差，从而减轻身份切换并增强对轨迹劫持的抵抗力。该流程对检测器和跟踪器的选择（例如，YOLOv4和DeepSORT）不敏感。我们在MOT16、MOT17、Cityscapes和KITTI数据集上进行了评估。结果表明，TSDCRF在隐私与效用之间实现了比白噪声和先前方法（NTPD，PPDTSA）更好的权衡：较低的KL散度偏移、较低的跟踪均方根误差（RMSE）以及在轨迹劫持下的增强鲁棒性，同时保持隐私。源代码可在https://github.com/mabo1215/TSDCRF.git获取。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2603.13669

SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

SHAMISA：用于自监督无参考图像质量评估的隐式结构关联的形状建模

Naseri, Mahdi, Wang, Zhou

Abstract

No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

Chinese Translation

无参考图像质量评估（NR-IQA）旨在在没有访问高质量参考图像的情况下估计感知质量。学习NR-IQA模型面临一个根本性瓶颈：需要大量昂贵的人类感知标签。我们提出了SHAMISA，这是一种非对比自监督框架，通过利用显式结构化关系监督，从未标记的失真图像中学习。与之前施加严格的二元相似性约束的方法不同，SHAMISA引入了隐式结构关联，定义为软的、可控的关系，这些关系既考虑失真又对内容敏感，从合成元数据和内在特征结构中推断得出。一个关键创新是我们的组合失真引擎，它从连续参数空间生成无数种退化形式，分组使得每次只有一个失真因素变化。这使得在训练过程中对表征相似性进行细粒度控制成为可能：具有共享失真模式的图像在嵌入空间中被拉近，而严重性变化则产生结构化、可预测的偏移。我们通过双源关系图整合这些见解，这些图编码了已知的退化特征和新兴的结构亲和性，以指导整个训练过程中的学习。一个卷积编码器在这种监督下进行训练，然后在推理时被冻结，其特征通过线性回归器进行质量预测。在合成、真实和跨数据集的NR-IQA基准上进行的广泛实验表明，SHAMISA在整体性能上表现出色，并且在跨数据集的泛化和鲁棒性方面有所改善，所有这些都无需人类质量注释或对比损失。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2603.13682

Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning

每个错误都有其严重性：用于多类多实例学习的非对称错误严重性训练

Hong, Sungrae, Jeong, Jiwon, Shin, Jisu, Han, Donghee, Lee, Sol, Kim, Kyungeun, Yi, Mun Yong

Abstract

Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake-severity-aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel's Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.

Chinese Translation

多实例学习（MIL）作为一种有前景的全幻灯片图像（WSI）诊断范式，能够在有限的标注下实现有效学习。然而，现有的MIL框架忽视了诊断优先级，未能区分多类中的误分类严重性，导致临床关键错误未得到解决。我们提出了一种错误严重性感知的训练策略，将诊断类别组织成层次结构，每个层次使用严重性加权的交叉熵损失进行优化，对高严重性误分类施加更强的惩罚。此外，通过概率对齐强制执行层次一致性，这是一种应用于实例包的语义特征重混合方法，以稳健地训练类别优先级并适应涉及多种症状的临床案例。我们还引入了一种基于非对称Mikel's Wheel的度量来量化特定于医学领域的错误严重性。在具有挑战性的公共和真实世界内部数据集上的实验表明，与现有方法相比，我们的方法显著减轻了MIL诊断中的关键错误。我们还在自然领域数据上呈现了额外的实验结果，以展示我们提出的方法在医学背景之外的可推广性。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2603.13708

RSEdit: Text-Guided Image Editing for Remote Sensing

RSEdit：基于文本引导的遥感图像编辑

Zhenyuan, Chen, Zechuan, Zhang, Feng, Zhang

Abstract

General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: https://github.com/Bili-Sakura/RSEdit-Preview

Chinese Translation

通用领域的文本引导图像编辑器能够实现强大的真实感，但会引入伪影、幻觉对象，并破坏遥感（RS）图像的正投影约束。我们将这一差距追溯到两个高层次的原因：（i）预训练模型中对遥感世界知识的限制，以及（ii）与地球观测数据的双时态结构和空间先验不一致的条件方案。我们提出了RSEdit，一个统一框架，通过通道连接和上下文标记连接，将预训练的文本到图像扩散模型（包括U-Net和DiT）适配为遵循指令的遥感编辑器。RSEdit在超过60,000对语义丰富的双时态遥感图像对上进行训练，学习到精确、物理一致的编辑，同时保留地理空间内容。实验结果显示，相较于通用和商业基线，RSEdit在多种场景（包括灾害影响、城市增长和季节变化）中展现出显著的提升，证明了其强大的泛化能力，使RSEdit成为下游分析的强大数据引擎。我们将发布代码、预训练模型、评估协议、训练日志和生成结果，以确保完全可重复性。代码链接：https://github.com/Bili-Sakura/RSEdit-Preview

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2603.13719

Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

多模态跟踪的稀疏-密集专家混合适配器

Zhu, Yabin, Li, Jianqi, Li, Chenglong, Wang, Jiaxiang, Gu, Chengjie, Tang, Jin

Abstract

Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.

Chinese Translation

参数高效微调（PEFT）技术，如提示和适配器，在多模态跟踪中得到了广泛应用，因为它们缓解了全模型微调带来的问题，包括时间效率低、高资源消耗、参数存储负担和灾难性遗忘。然而，由于跨模态的异质性，大多数现有的基于PEFT的方法在一个具有共享参数的统一框架内表现多模态特征方面面临挑战。为了解决这个问题，我们提出了一种新颖的稀疏-密集专家混合适配器（SDMoEA）框架，以支持基于PEFT的多模态跟踪，采用统一的模型结构。具体而言，我们设计了一个SDMoE模块作为多模态适配器，以高效地建模模态特定和共享信息。SDMoE由一个稀疏MoE和一个密集共享MoE组成：前者捕获模态特定信息，而后者建模共享的跨模态信息。此外，为克服现有跟踪方法在多级多模态融合中建模高阶相关性的局限，我们引入了一种基于Gram的语义对齐超图融合（GSAHF）模块。该模块首先利用Gram矩阵进行跨模态语义对齐，确保构建的超图准确反映模态之间的语义相似性和高阶依赖关系。然后，将对齐后的特征整合到超图结构中，以利用其建模高阶关系的能力，从而实现多级多模态信息的深度融合。大量实验表明，所提方法在多个多模态跟踪基准（包括LasHeR、RGBT234、VTUAV、VisEvent、COESOT、DepthTrack和VOT-RGBD2022）上相比其他PEFT方法具有更优的性能。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2603.13728

Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search

Bodhi VLM：通过自下而上和自上而下特征搜索实现视觉主干和视觉语言模型中的层次视觉表示的隐私对齐建模

Ma, Bo, Wu, Jinsong, Yan, Wei Qi

Abstract

Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/\epsilon$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}

Chinese Translation

保护隐私的学习系统通常会在层次视觉表示中注入噪声；一个核心挑战是如何 extit{建模}这些扰动如何与声明的隐私预算对齐，以便在视觉主干和视觉语言模型（VLMs）中具有可解释性和适用性。我们提出了 extit{Bodhi VLM}，一个用于 extit{层次神经表示}的 extit{隐私对齐建模}框架：它（1）通过基于NCP和MDAV的聚类将敏感概念与层级分组联系起来；（2）利用多尺度表示（例如，特征金字塔或视觉编码器层）中的自下而上（BUA）和自上而下（TDA）策略定位敏感特征区域；（3）使用期望最大化隐私评估（EMPA）模块，通过将拟合的敏感特征分布与评估者指定的参考（例如，具有尺度$c/ ext{ε}$的拉普拉斯或高斯分布）进行比较，生成可解释的 extit{预算对齐信号}。输出是相对参考的，并且 extit{不是}正式的差分隐私估计器。我们对层次特征结构上的BUA/TDA进行了形式化，并在物体检测器（YOLO、PPDPTS、DETR）和VLM的 extit{视觉编码器}（CLIP、LLaVA、BLIP）上验证了该框架。BUA和TDA产生了可比的偏差趋势；EMPA在报告的设置下提供了稳定的对齐信号。我们与通用差异基线（卡方、K-L、MMD）和与任务相关的基线（MomentReg、NoiseMLE、Wass-1）进行了比较。结果以多个种子的均值$ ext{±std}$报告，置信区间在补充材料中提供。此项工作为隐私对齐的层次表示提供了一个可学习、可解释的建模视角，而不仅仅是事后审计。源代码： extit{Bodhi-VLM GitHub repository}（链接：https://github.com/mabo1215/bodhi-vlm.git）

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2603.13739

UniVid: Pyramid Diffusion Model for High Quality Video Generation

UniVid：用于高质量视频生成的金字塔扩散模型

Xiao, Xinyu, Yang, Binbin, Li, Tingtian, Yu, Yipeng, Lei, Sen

Abstract

Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks.

Chinese Translation

基于扩散的文本到视频生成（T2V）或图像到视频生成（I2V）已成为一个重要的研究焦点。然而，将这两种生成范式整合为一个统一模型仍然面临挑战。本文提出了一种统一的视频生成模型（UniVid），该模型结合了文本提示和参考图像的混合条件。在这两种可用控制下，我们的模型能够从文本提示中提取对象的外观及其运动描述，同时从图像线索中获取纹理细节和结构信息，以指导视频生成过程。具体而言，我们通过引入时间金字塔跨帧时空注意力模块和卷积，扩展了预训练的文本到图像扩散模型，以生成时间上连贯的帧。为了支持双模态控制，我们引入了一种双流跨注意力机制，其注意力得分可以在推理过程中自由重新加权，以实现单一和双模态控制之间的插值。大量实验表明，我们的UniVid在T2V、I2V和（T+I）2V任务上实现了优越的时间一致性。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2603.13740

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Sky2Ground：一个用于不同高度下场地建模的基准

Wang, Zengyan, Mitra, Sirshapan, Modi, Rajat, Lim, Grace, Rawat, Yogesh

Abstract

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.

Chinese Translation

我们介绍了Sky2Ground，这是一个为不同高度相机定位、对应学习和重建而设计的三视图数据集。该数据集结合了结构化的合成图像和真实的野外图像，提供了受控的多视图几何和真实场景噪声。51个场地中的每一个都包含数千张卫星、空中和地面图像，覆盖广泛的高度范围和几乎正交的视角，从而能够在全球到地方的背景下进行严格评估。我们对最先进的姿态估计模型进行了基准测试，包括MASt3R、DUSt3R、Map Anything和VGGT，并观察到使用卫星图像往往会降低性能，突显了在大高度变化下的挑战。我们还考察了重建方法，强调了稀疏几何重叠、视角变化以及使用真实图像所带来的挑战，这些因素通常会引入噪声并降低渲染质量。为了解决这些挑战，我们提出了SkyNet，这是一种模型，通过基于课程的训练策略增强了在结合卫星图像时的跨视图一致性，以逐步引入更多的卫星视图。SkyNet显著增强了多视图对齐，并在绝对性能方面在RRA@5上超越了现有方法9.6%，在RTA@5上超越了18.1%。Sky2Ground和SkyNet共同建立了一个全面的测试平台和基准，以推动大规模、多高度的3D感知和可推广的相机定位。代码和模型将公开发布以供未来研究使用。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2603.13741

Ego-1K -- A Large-Scale Multiview Video Dataset for Egocentric Vision

Ego-1K - 大规模多视角视频数据集用于自我中心视觉

Lee, Jae Yong, Scharstein, Daniel, Bapat, Akash, Hu, Hao, Fu, Andrew, Zhao, Haoru, Sammut, Paul, Li, Xiang, Jeapes, Stephen, Gupta, Anik, David, Lior, Madhuvarasu, Saketh, Joshi, Jay Girish, Wither, Jason

Abstract

We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at https://huggingface.co/datasets/facebook/ego-1k.

Chinese Translation

我们提出了Ego-1K，这是一个大规模的时间同步自我中心多视角视频集合，旨在推动神经三维视频合成和动态场景理解。该数据集包含近1000段短自我中心视频，这些视频是使用一个定制设备捕获的，该设备配备了12个同步摄像头，围绕用户佩戴的4摄像头虚拟现实头盔进行拍摄。场景内容主要集中在不同环境中的手部动作和手物体交互上。我们描述了设备设计、数据处理和校准过程。我们的数据集为自我中心场景重建方法提供了新的基准测试方式，这是一个重要的研究领域，因为配备多个摄像头的智能眼镜正变得无处不在。我们的实验表明，由于近距离动态物体和设备自我运动造成的大幅差异和图像运动，我们的数据集对现有的三维和四维新视图合成方法提出了独特的挑战。我们的数据集支持在这个具有挑战性的领域进行未来研究。可在 https://huggingface.co/datasets/facebook/ego-1k 获得。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2603.13745

Multi-Object Advertisement Creative Generation

多对象广告创意生成

Gao, Jialu, Gupta, Mithun Das, Li, Qun, Kshatriya, Raveena, Wilson, Andrew D., Chang, Keng-hao, Kumaravel, Balasaravanan Thoravi

Abstract

Lifestyle images are photographs that capture environments and objects in everyday settings. In furniture product marketing, advertisers often create lifestyle images containing products to resonate with potential buyers, allowing buyers to visualize how the products fit into their daily lives. While recent advances in Generative Artificial Intelligence (GenAI) have given rise to realistic image content creation, their application in e-commerce advertising is challenging because high-quality ads must authentically representing the products in realistic scearios. Therefore, manual intervention is usually required for individual generations, making it difficult to scale to larger product catalogs. To understand the challenges faced by advertisers using GenAI to create lifestyle images at scale, we conducted evaluations on ad images generated using state-of-the-art image generation models and identified the major challenges. Based on our findings, we present CreativeAds, a multi-product ad creation system that supports scalable automated generation with customized parameter adjustment for individual generation. To ensure automated high-quality ad generation, CreativeAds innovates a pipeline that consists of three modules to address challenges in product pairing, layout generation, and background generation separately. Furthermore, CreativeAds contains an intuitive user interface to allow users to oversee generation at scale, and it also supports detailed controls on individual generation for user customized adjustments. We performed a user study on CreativeAds and extensive evaluations of the generated images, demonstrating CreativeAds's ability to create large number of high-quality images at scale for advertisers without requiring expertise in GenAI tools.

Chinese Translation

生活方式图像是捕捉日常环境和物体的照片。在家具产品营销中，广告商通常会创建包含产品的生活方式图像，以便与潜在买家产生共鸣，使买家能够想象这些产品如何融入他们的日常生活。尽管生成性人工智能（Generative Artificial Intelligence, GenAI）在现实图像内容创作方面取得了近期进展，但其在电子商务广告中的应用仍然面临挑战，因为高质量的广告必须真实地代表产品在现实场景中的表现。因此，通常需要人工干预以进行单个生成，这使得难以扩展到更大的产品目录。为了理解广告商在使用GenAI大规模创建生活方式图像时面临的挑战，我们对使用最先进图像生成模型生成的广告图像进行了评估，并识别出主要挑战。基于我们的发现，我们提出了CreativeAds，一个支持可扩展自动生成的多产品广告创作系统，并为个别生成提供定制参数调整。为了确保自动生成高质量广告，CreativeAds创新了一条管道，由三个模块组成，分别解决产品配对、布局生成和背景生成中的挑战。此外，CreativeAds还包含一个直观的用户界面，允许用户在大规模生成过程中进行监督，并支持对个别生成进行详细控制，以便用户进行定制调整。我们对CreativeAds进行了用户研究和生成图像的广泛评估，展示了CreativeAds在不需要GenAI工具专业知识的情况下，为广告商大规模创建大量高质量图像的能力。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2603.13759

QTrack: Query-Driven Reasoning for Multi-modal MOT

QTrack：基于查询驱动的多模态多目标跟踪

Ashraf, Tajamul, Tariq, Tavaheed, Yadav, Sonia, Riyaz, Abrar Ul, Tak, Wasif, Abdar, Moloud, Bashir, Janibul

Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack

Chinese Translation

多目标跟踪（MOT）传统上专注于估计视频中所有物体的轨迹，而未能在语义指令下对用户指定的目标进行选择性推理。在本研究中，我们提出了一种基于查询驱动的跟踪范式，将跟踪形式化为一个基于自然语言查询的时空推理问题。给定参考帧、视频序列和文本查询，目标是仅定位和跟踪查询中指定的目标，同时保持时间一致性和身份一致性。为了支持这一设置，我们构建了RMOT26，这是一个大规模基准，具有基础查询和序列级划分，以防止身份泄漏并实现对泛化能力的稳健评估。我们进一步提出了QTrack，这是一种端到端的视觉-语言模型，将多模态推理与跟踪导向的定位相结合。此外，我们引入了一种具有结构化奖励的时间感知策略优化策略，以鼓励运动感知推理。大量实验表明，我们的方法在以推理为中心的语言引导跟踪任务中具有有效性。代码和数据可在 https://github.com/gaash-lab/QTrack 获取。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2603.13770

PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

PhysAlign：通过特征和三维表示对齐实现物理一致的图像到视频生成

Xiong, Zhexiao, Song, Yizhi, He, Liu, Xiong, Wei, Yuan, Yu, Qiao, Feng, Jacobs, Nathan

Abstract

Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at https://physalign.github.io/PhysAlign.

Chinese Translation

视频扩散模型（VDMs）为模拟动态场景和环境提供了一种有前景的方法，广泛应用于机器人技术和媒体生成。然而，现有模型往往生成时间上不连贯的内容，违反基本的物理直觉，显著限制了其实际应用性。我们提出了PhysAlign，这是一个高效的物理一致的图像到视频（I2V）生成框架，明确解决了这一局限性。为了克服物理标注视频的严重匮乏，我们首先基于刚体模拟构建了一个完全可控的合成数据生成管道，生成了一个高度精细的数据集，包含准确、细致的物理和三维标注。利用这些数据，PhysAlign通过将显式的三维几何约束与基于Gram的时空关系对齐相结合，构建了一个统一的物理潜在空间，从视频基础模型中提取运动学先验。大量实验表明，PhysAlign在需要复杂物理推理和时间稳定性的任务上显著优于现有的VDMs，并且在零样本视觉质量上没有妥协。PhysAlign展现了弥合原始视觉合成与刚体运动学之间差距的潜力，为真正基于物理的视频生成建立了一个实用的范式。项目页面可访问 https://physalign.github.io/PhysAlign。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2603.13771

Brain Tumor Classification from 3D MRI Using Persistent Homology and Betti Features: A Topological Data Analysis Approach on BraTS2020

基于持久同调和Betti特征的3D MRI脑肿瘤分类：一种在BraTS2020数据集上的拓扑数据分析方法

Ahmed, Faisal

Abstract

Accurate and interpretable brain tumor classification from medical imaging remains a challenging problem due to the high dimensionality and complex structural patterns present in magnetic resonance imaging (MRI). In this study, we propose a topology-driven framework for brain tumor classification based on Topological Data Analysis (TDA) applied directly to three-dimensional (3D) MRI volumes. Specifically, we analyze 3D Fluid Attenuated Inversion Recovery (FLAIR) images from the BraTS 2020 dataset and extract interpretable topological descriptors using persistent homology. Persistent homology captures intrinsic geometric and structural characteristics of the data through Betti numbers, which describe connected components (Betti-0), loops (Betti-1), and voids (Betti-2). From the 3D MRI volumes, we derive a compact set of 100 topological features that summarize the underlying topology of brain tumor structures. These descriptors represent complex 3D tumor morphology while significantly reducing data dimensionality. Unlike many deep learning approaches that require large-scale training data or complex architectures, the proposed framework relies on computationally efficient topological features extracted directly from the images. These features are used to train classical machine learning classifiers, including Random Forest and XGBoost, for binary classification of high-grade glioma (HGG) and low-grade glioma (LGG). Experimental results on the BraTS 2020 dataset show that the Random Forest classifier combined with selected Betti features achieves an accuracy of 89.19%. These findings highlight the potential of persistent homology as an effective and interpretable approach for analyzing complex 3D medical images and performing brain tumor classification.

Chinese Translation

从医学影像中准确且可解释的脑肿瘤分类仍然是一个具有挑战性的问题，因为磁共振成像（MRI）中存在高维度和复杂的结构模式。在本研究中，我们提出了一种基于拓扑数据分析（TDA）的拓扑驱动框架，用于脑肿瘤分类，直接应用于三维（3D）MRI体积。具体而言，我们分析了来自BraTS 2020数据集的3D流体衰减反转恢复（FLAIR）图像，并使用持久同调提取可解释的拓扑描述符。持久同调通过Betti数捕捉数据的内在几何和结构特征，描述连接组件（Betti-0）、环（Betti-1）和空洞（Betti-2）。从3D MRI体积中，我们推导出一组紧凑的100个拓扑特征，以总结脑肿瘤结构的基本拓扑。这些描述符代表复杂的3D肿瘤形态，同时显著降低数据维度。与许多需要大规模训练数据或复杂架构的深度学习方法不同，所提出的框架依赖于直接从图像中提取的计算效率高的拓扑特征。这些特征用于训练经典的机器学习分类器，包括随机森林（Random Forest）和XGBoost，用于高等级胶质瘤（HGG）和低等级胶质瘤（LGG）的二分类。在BraTS 2020数据集上的实验结果表明，结合选定Betti特征的随机森林分类器达到了89.19%的准确率。这些发现突显了持久同调作为分析复杂3D医学图像和进行脑肿瘤分类的有效且可解释的方法的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2603.13779

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

AD-Copilot：一种通过视觉上下文比较进行工业异常检测的视觉-语言助手

Jiang, Xi, Guo, Yue, Li, Jian, Liu, Yong, Gao, Bin-Bin, Deng, Hanqiu, Liu, Jun, Zhao, Heng, Wang, Chengjie, Zheng, Feng

Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

Chinese Translation

多模态大型语言模型（MLLMs）在自然视觉理解方面取得了令人瞩目的成功，但在工业异常检测（IAD）中表现不佳。这是因为MLLMs主要在通用网络数据上训练，与工业图像存在显著差异。此外，它们独立编码每张图像，只能在语言空间中比较图像，导致对IAD中关键的微妙视觉差异不敏感。为了解决这些问题，我们提出了AD-Copilot，一种通过视觉上下文比较专门针对IAD的交互式MLLM。我们首先设计了一种新颖的数据整理管道，从稀疏标注的工业图像中挖掘检查知识，并生成用于图像说明、视觉问答（VQA）和缺陷定位的精确样本，形成一个丰富语义信号的大规模多模态数据集Chat-AD，适用于IAD。在此基础上，AD-Copilot结合了一种新颖的比较编码器，通过配对图像特征之间的交叉注意力来增强多图像的细粒度感知，并采用多阶段策略进行训练，结合领域知识并逐步提升IAD技能。此外，我们引入了MMAD-BBox，一个基于边界框评估的异常定位扩展基准。实验表明，AD-Copilot在MMAD基准上达到了82.3%的准确率，超越了所有其他模型且没有任何数据泄露。在MMAD-BBox测试中，其相较于基线实现了最高$3.35 imes$的提升。AD-Copilot在其他专业和通用基准上也展现了其性能提升的良好泛化能力。值得注意的是，AD-Copilot在多个IAD任务中超越了人类专家的表现，展示了其作为现实工业检查可靠助手的潜力。所有数据集和模型将会发布，以便更广泛地造福社区。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2603.13783

RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

RetimeGS：4D高斯溅射的连续时间重建

Wang, Xuezhen, Ma, Li, Shen, Yulin, Wang, Zeyu, Sander, Pedro V.

Abstract

Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow-guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.

Chinese Translation

时间重定时，即在任意时间戳重建和渲染动态场景的能力，对于慢动作播放、时间编辑和后期制作等应用至关重要。然而，大多数现有的4D高斯溅射（4DGS）方法在离散帧索引上过拟合，但在表示连续时间帧时却面临困难，导致在时间戳之间插值时出现鬼影伪影。我们将这一限制视为一种时间混叠现象，并提出了RetimeGS，这是一种简单而有效的4DGS表示，明确定义了3D高斯的时间行为，并减轻了时间混叠。为了实现平滑且一致的插值，我们结合了光流引导的初始化和监督、三重渲染监督以及其他针对性的策略。这些组件共同使得即使在大运动下也能实现无鬼影、时间一致的渲染。在包含快速运动、非刚性变形和严重遮挡的数据集上的实验表明，RetimeGS在质量和一致性方面优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2603.13787

Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective

从系统生物学视角推进癌症预后：基因组、蛋白质组和病理影像数据的层次融合

Zhou, Junjie, Xue, Bao, Wang, Meiling, Shao, Wei, Zhang, Daoqiang

Abstract

To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.

Chinese Translation

为了提高癌症预后的精确性，近期研究越来越关注通过整合基因组数据和组织学图像的多模态生存方法。然而，当前的方法忽视了蛋白质组作为连接基因组变化和组织病理特征的中介层，同时提供了对生存预测至关重要的补充生物信息。这一生物现实暴露了另一个架构限制：现有的整合分析研究以平面方式融合这些异质数据源，未能捕捉其固有的生物层次结构。为了解决这些局限性，我们提出了HFGPI（层次融合框架），从系统生物学的角度建模基因到蛋白质再到组织学图像的生物学进程。具体而言，我们引入了分子标记器（Molecular Tokenizer），这是一种分子编码策略，将身份嵌入与表达谱结合，以构建具有生物学信息的基因和蛋白质表示。接着，我们开发了基因调控蛋白融合（Gene-Regulated Protein Fusion, GRPF），该方法采用图感知交叉注意力与结构保持对齐，明确建模基因-蛋白质调控关系并生成基因调控蛋白表示。此外，我们提出了蛋白质引导超图学习（Protein-Guided Hypergraph Learning, PGHL），该方法建立蛋白质与图像块之间的关联，利用超图卷积捕捉更高阶的蛋白质-形态关系。最终特征在层次层中逐步融合，以实现精确的生存结果预测。在五个基准数据集上的广泛实验表明，HFGPI优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2603.13800

Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space

超越医学诊断：医学多模态大语言模型如何在空间中思考

Trinh, Quoc-Huy, Ding, Xi, Liu, Yang, Qin, Zhenyue, Li, Xingjian, Durak, Gorkem, Aktas, Halil Ertugrul, Keles, Elif, Bagci, Ulas, Xu, Min

Abstract

Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.

Chinese Translation

视觉空间智能对于医学图像解读至关重要，但在三维成像的多模态大语言模型（MLLMs）中仍然未得到充分探索。这一空白的存在主要是由于缺乏具有结构化三维空间标注的数据集，超出了基本标签的范围。在本研究中，我们介绍了一种自主合成空间视觉问答（VQA）数据的代理管道，通过协调计算工具（如体积和距离计算器）与多代理协作及专家放射科医师的验证。我们提出了SpatialMed，这是第一个用于评估医学MLLMs中三维空间智能的综合基准，包含近10,000个跨多个器官和肿瘤类型的问题-答案对。我们对14种最先进的MLLMs进行评估和广泛分析，结果显示当前模型在医学成像方面缺乏强大的空间推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2603.13803

ALTIS: Automated Loss Triage and Impact Scoring from Sentinel-1 SAR for Property-Level Flood Damage Assessment

ALTIS：基于Sentinel-1 SAR的自动化损失分类与影响评分用于财产级别洪水损害评估

Vinaykumar, Amogh, Kamasani, Prem

Abstract

Floods are among the costliest natural catastrophes globally, yet the property and casualty insurance industry's post-event response remains heavily reliant on manual field inspection: slow, expensive, and geographically constrained. Satellite Synthetic Aperture Radar (SAR) offers cloud-penetrating, all-weather imaging uniquely suited to rapid post-flood assessment, but existing research evaluates SAR flood detection against academic benchmarks such as IoU and F1-score that do not capture insurance-workflow requirements. We present ALTIS: a five-stage pipeline transforming raw Sentinel-1 GRD and SLC imagery into property-level impact scores within 24-48 hours of flood peak. Unlike prior approaches producing pixel-level maps or binary outputs, ALTIS delivers a ranked, confidence-scored triage list consumable by claims platforms, integrating (i) multi-temporal SAR change detection using dual-polarization VV/VH intensity and InSAR coherence, (ii) physics-informed depth estimation fusing flood extent with high-resolution DEMs, (iii) property-level zonal statistics from parcel footprints, (iv) depth-damage calibration against NFIP claims, and (v) confidence-scored triage ranking. We formally define Insurance-Grade Flood Triage (IGFT) and introduce the Inspection Reduction Rate (IRR) and Triage Efficiency Score (TES). Using Hurricane Harvey (2017) across Harris County, Texas, we present preliminary analysis grounded in validated sub-components suggesting ALTIS is designed to achieve an IRR of approximately 0.52 at 90% recall of high-severity claims, potentially eliminating over half of unnecessary dispatches. By blending SAR flood intelligence with the realities of claims management, ALTIS establishes a methodological baseline for translating earth observation research into measurable insurance outcomes.

Chinese Translation

洪水是全球成本最高的自然灾害之一，但财产和伤亡保险行业在事件后的响应仍然严重依赖于人工现场检查，这一过程缓慢、昂贵且受地理限制。卫星合成孔径雷达（SAR）提供了穿透云层、适用于各种天气条件的成像，特别适合快速的洪水后评估，但现有研究在评估SAR洪水检测时使用的学术基准如IoU和F1-score并未考虑保险工作流程的需求。我们提出了ALTIS：一个五阶段的流程，将原始的Sentinel-1 GRD和SLC影像转化为洪水峰值后24-48小时内的财产级别影响评分。与之前生成像素级地图或二元输出的方法不同，ALTIS提供了一个经过排名和置信度评分的分类列表，能够被理赔平台使用，整合了（i）使用双极化VV/VH强度和InSAR相干性进行的多时相SAR变化检测，（ii）融合洪水范围与高分辨率数字高程模型（DEM）的物理信息深度估计，（iii）来自地块轮廓的财产级区域统计，（iv）针对国家洪水保险计划（NFIP）索赔的深度-损害校准，以及（v）经过置信度评分的分类排名。我们正式定义了保险级洪水分类（IGFT），并引入了检查减少率（IRR）和分类效率评分（TES）。以2017年哈维飓风为例，我们在德克萨斯州哈里斯县进行的初步分析基于经过验证的子组件，表明ALTIS旨在实现约0.52的IRR，且在高严重性索赔的召回率达到90%的情况下，可能消除超过一半的不必要派遣。通过将SAR洪水情报与理赔管理的现实相结合，ALTIS为将地球观测研究转化为可衡量的保险结果建立了方法学基础。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2603.13831

Efficient Semi-Automated Material Microstructure Analysis Using Deep Learning: A Case Study in Additive Manufacturing

基于深度学习的高效半自动化材料微观结构分析：增材制造案例研究

Navaratna, Sanjeev S., Thawari, Nikhil, Mari, Gunashekhar, P, Amritha V, Amirthalingam, Murugaiyan, Batra, Rohit

Abstract

Image segmentation is fundamental to microstructural analysis for defect identification and structure-property correlation, yet remains challenging due to pronounced heterogeneity in materials images arising from varied processing and testing conditions. Conventional image processing techniques often fail to capture such complex features rendering them ineffective for large-scale analysis. Even deep learning approaches struggle to generalize across heterogeneous datasets due to scarcity of high-quality labeled data. Consequently, segmentation workflows often rely on manual expert-driven annotations which are labor intensive and difficult to scale. Using an additive manufacturing (AM) dataset as a case study, we present a semi-automated active learning based segmentation pipeline that integrates a U-Net based convolutional neural network with an interactive user annotation and correction interface and a representative core-set image selection strategy. The active learning workflow iteratively updates the model by incorporating user corrected segmentations into the training pool while the core-set strategy identifies representative images for annotation. Three subset selection strategies, manual selection, uncertainty driven sampling and proposed maximin Latin hypercube sampling from embeddings (SMILE) method were evaluated over six refinement rounds. The SMILE strategy consistently outperformed other approaches, improving the macro F1 score from 0.74 to 0.93 while reducing manual annotation time by about 65 percent. The segmented defect regions were further analyzed using a coupled classification model to categorize defects based on microstructural characteristics and map them to corresponding AM process parameters. The proposed framework reduces labeling effort while maintaining scalability and robustness and is broadly applicable to image based analysis across diverse materials systems.

Chinese Translation

图像分割是微观结构分析中识别缺陷和结构-性能关联的基础，但由于材料图像在不同加工和测试条件下表现出的显著异质性，这一过程仍然具有挑战性。传统的图像处理技术往往无法捕捉到如此复杂的特征，使其在大规模分析中效果不佳。即使是深度学习方法，由于高质量标注数据的稀缺，也难以在异质数据集上进行泛化。因此，分割工作流程通常依赖于人工专家驱动的标注，这既费时又难以扩展。以增材制造（Additive Manufacturing, AM）数据集为案例，我们提出了一种基于半自动化主动学习的分割管道，该管道将基于 U-Net 的卷积神经网络与交互式用户标注和修正界面以及代表性核心集图像选择策略相结合。主动学习工作流程通过将用户修正的分割结果纳入训练池，迭代更新模型，而核心集策略则识别用于标注的代表性图像。我们评估了三种子集选择策略：手动选择、不确定性驱动采样和提出的最大最小拉丁超立方体采样（SMILE）方法，进行了六轮精炼。SMILE 策略始终优于其他方法，将宏观 F1 分数从 0.74 提高到 0.93，同时减少了约 65% 的手动标注时间。随后，使用耦合分类模型对分割出的缺陷区域进行进一步分析，以根据微观结构特征对缺陷进行分类，并将其映射到相应的 AM 过程参数。所提出的框架在保持可扩展性和鲁棒性的同时减少了标注工作量，广泛适用于各种材料系统的基于图像的分析。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2603.13843

MOGeo: Beyond One-to-One Cross-View Object Geo-localization

MOGeo：超越一对一视角间物体地理定位

Lv, Bo, Zhang, Qingwang, Wu, Le, Li, Yuanyuan, Zhu, Yingying

Abstract

Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistic setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance the CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross-view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods. Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view object geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.

Chinese Translation

视角间物体地理定位（CVOGL）旨在在查询图像中定位感兴趣的物体，并在相应的卫星图像中找到其位置。现有方法通常假设查询图像仅包含单个物体，这与现实应用中复杂的多物体地理定位需求不符，使其在实际场景中不适用。为了弥合现实环境与现有任务之间的差距，我们提出了一项新任务，称为视角间多物体地理定位（CVMOGL）。为推进CVMOGL任务，我们首先构建了基准数据集CMLocation，其中包括两个数据集：CMLocation-V1和CMLocation-V2。此外，我们提出了一种新颖的视角间多物体地理定位方法MOGeo，并将其与现有的最先进方法进行基准比较。我们在各种应用场景下进行了大量实验，以验证我们方法的有效性。结果表明，在更具现实性的设置中，视角间物体地理定位仍然是一个具有挑战性的问题，鼓励在该领域进行进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2603.13855

VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies

VFM-Loc：通过对齐判别性视觉层次实现零-shot跨视角地理定位

Lu, Jun, Sang, Zehao, Wei, Haoqi, Liu, Xiangyun, Zhu, Kun, Guo, Haitao, Gong, Zhihui, Ding, Lei

Abstract

Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: https://github.com/DingLei14/VFM-Loc.

Chinese Translation

远程感知中的跨视角地理定位（CVGL）旨在通过将无人机视角查询与地理标记的卫星图像进行匹配来定位查询对象。尽管监督方法在闭集基准测试中取得了良好的结果，但由于视角差异和数据集偏差，它们往往无法在不受约束的真实场景中进行有效泛化。为克服这些局限性，我们提出了VFM-Loc，这是一个无训练的零-shot CVGL框架，利用视觉基础模型（VFMs）中的可泛化视觉表征。VFM-Loc通过渐进对齐策略识别并匹配不同视角下的判别性视觉线索。首先，我们设计了一种层次线索提取机制，使用广义均值池化和尺度加权的RMAC，以在保持层次置信度的同时保留跨尺度的独特视觉线索。其次，我们引入了一种基于领域主成分分析（PCA）和正交Procrustes分析的统计流形对齐管道，在共享度量空间中线性对齐异构特征分布。实验表明，VFM-Loc在标准基准测试中表现出强大的零-shot准确性，并在具有大斜角的挑战性LO-UCV数据集上超过监督方法20%以上的Recall@1。该工作强调了预训练特征的原则性对齐可以有效弥合跨视角差距，为真实世界的CVGL建立了一个稳健且无训练的范式。相关代码可在以下链接获取：https://github.com/DingLei14/VFM-Loc。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2603.13858

Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

通过创造学习：一种无哈希的即时类别发现框架

Zhang, Bohan, Tang, Weidong, Chi, Zhixiang, Jin, Yi, Li, Zhenbo, Wang, Yang, Wu, Yanan

Abstract

On-the-Fly Category Discovery (OCD) aims to recognize known classes while simultaneously discovering emerging novel categories during inference, using supervision only from known classes during offline training. Existing approaches rely either on fixed label supervision or on diffusion-based augmentations to enhance the backbone, yet none of them explicitly train the model to perform the discovery task required at test time. It is fundamentally unreasonable to expect a model optimized on limited labeled data to carry out a qualitatively different discovery objective during inference. This mismatch creates a clear optimization misalignment between the offline learning stage and the online discovery stage. In addition, prior methods often depend on hash-based encodings or severe feature compression, which further limits representational capacity. To address these issues, we propose Learning through Creation (LTC), a fully feature-based and hash-free framework that injects novel-category awareness directly into offline learning. At its core is a lightweight, online pseudo-unknown generator driven by kernel-energy minimization and entropy maximization (MKEE). Unlike previous methods that generate synthetic samples once before training, our generator evolves jointly with the model dynamics and synthesizes pseudo-novel instances on the fly at negligible cost. These samples are incorporated through a dual max-margin objective with adaptive thresholding, strengthening the model's ability to delineate and detect unknown regions through explicit creation. Extensive experiments across seven benchmarks show that LTC consistently outperforms prior work, achieving improvements ranging from 1.5 percent to 13.1 percent in all-class accuracy. The code is available at https://github.com/brandinzhang/LTC

Chinese Translation

即时类别发现（On-the-Fly Category Discovery, OCD）旨在在推理过程中识别已知类别的同时发现新兴的未知类别，仅在离线训练中使用已知类别的监督。现有方法要么依赖于固定标签的监督，要么依赖于基于扩散的增强来增强主干网络，但没有任何方法明确训练模型以执行测试时所需的发现任务。期望在有限标记数据上优化的模型在推理过程中执行质的不同的发现目标是根本不合理的。这种不匹配在离线学习阶段和在线发现阶段之间造成了明显的优化不一致。此外，先前的方法通常依赖于基于哈希的编码或严重的特征压缩，这进一步限制了表示能力。为了解决这些问题，我们提出了通过创造学习（Learning through Creation, LTC），这是一种完全基于特征且无哈希的框架，直接将新类别意识注入到离线学习中。其核心是一个轻量级的在线伪未知生成器，驱动于核能量最小化和熵最大化（MKEE）。与之前的方法在训练前一次性生成合成样本不同，我们的生成器与模型动态共同演变，并在几乎没有成本的情况下即时合成伪新实例。这些样本通过具有自适应阈值的双最大边际目标被纳入，从而增强模型通过显式创造来划分和检测未知区域的能力。在七个基准测试中的广泛实验表明，LTC始终优于先前的工作，在全类别准确率上实现了1.5%到13.1%的提升。代码可在 https://github.com/brandinzhang/LTC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2603.13859

Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics

Geo-ID：测试时几何一致性用于跨视图一致的内参

Dirik, Alara, Zafeiriou, Stefanos

Abstract

Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.

Chinese Translation

内在图像分解旨在从图像中估计基于物理的渲染（PBR）参数，如反射率、粗糙度和金属度。尽管最近的方法在单视图预测中取得了良好的效果，但将它们独立应用于同一场景的多个视图往往会产生不一致的估计，从而限制了它们在可编辑神经场景和三维重建等下游应用中的使用。基于视频的模型可以提高跨帧一致性，但需要密集、有序的序列和大量计算，限制了它们在稀疏、无序图像集合中的适用性。我们提出了Geo-ID，一种新颖的测试时框架，通过稀疏几何对应关系将独立的每视图预测耦合，形成不确定性感知的一致性目标，从而重新利用预训练的单视图内参预测器，生成跨视图一致的分解。Geo-ID是模型无关的，无需重新训练或逆向渲染，并可直接应用于现成的内参预测器。在合成基准和真实场景上的实验表明，随着视图数量的增加，跨视图内参一致性有了显著改善，同时保持了可比的单视图分解性能。我们进一步展示了所得到的一致性内参能够在下游神经场景表示中实现连贯的外观编辑和重光照。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2603.13874

Zero-Forgetting CISS via Dual-Phase Cognitive Cascades

通过双阶段认知级联实现零遗忘的类增量语义分割

Lu, Yuquan, Guo, Yifu, Xu, Zishan, Zhang, Siyu, Huo, Yu, Chen, Siyue, Wu, Siyan, Zhu, Chenghua, Wang, Ruixuan

Abstract

Continual semantic segmentation (CSS) is a cornerstone task in computer vision that enables a large number of downstream applications, but faces the catastrophic forgetting challenge. In conventional class-incremental semantic segmentation (CISS) frameworks using Softmax-based classification heads, catastrophic forgetting originates from Catastrophic forgetting and task affiliation probability. We formulate these problems and provide a theoretical analysis to more deeply understand the limitations in existing CISS methods, particularly Strict Parameter Isolation (SPI). To address these challenges, we follow a dual-phase intuition from human annotators, and introduce Cognitive Cascade Segmentation (CogCaS), a novel dual-phase cascade formulation for CSS tasks in the CISS setting. By decoupling the task into class-existence detection and class-specific segmentation, CogCaS enables more effective continual learning, preserving previously learned knowledge while incorporating new classes. Using two benchmark datasets PASCAL VOC 2012 and ADE20K, we have shown significant improvements in a variety of challenging scenarios, particularly those with long sequence of incremental tasks, when compared to exsiting state-of-the-art methods. Our code will be made publicly available upon paper acceptance.

Chinese Translation

持续语义分割（CSS）是计算机视觉中的一个基础任务，能够支持大量下游应用，但面临灾难性遗忘的挑战。在传统的基于Softmax分类头的类增量语义分割（CISS）框架中，灾难性遗忘源于灾难性遗忘和任务归属概率。我们对这些问题进行了公式化，并提供了理论分析，以更深入地理解现有CISS方法的局限性，特别是严格参数隔离（SPI）。为了解决这些挑战，我们借鉴了人类标注者的双阶段直觉，提出了认知级联分割（CogCaS），这是一种针对CISS设置中CSS任务的新颖双阶段级联公式。通过将任务解耦为类存在检测和类特定分割，CogCaS实现了更有效的持续学习，既保留了先前学习的知识，又纳入了新类别。使用两个基准数据集PASCAL VOC 2012和ADE20K，我们在多种具有挑战性的场景中，特别是那些具有长序列增量任务的场景中，与现有的最先进方法相比，显示了显著的改进。我们的代码将在论文接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2603.13878

Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering

Step-CoT：用于医学视觉问答的逐步视觉推理链

Fan, Lin, Ou, Yafei, Deng, Zhipeng, Dai, Pengyu, Chongxian, Hou, Yan, Jiale, Li, Yaqian, Long, Kaiwen, Gong, Xun, Ikebe, Masayuki, Zheng, Yefeng

Abstract

Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model's reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step-CoT. Dataset Card: huggingface.co/datasets/fl-15o/Step-CoT

Chinese Translation

推理链（CoT）思维已经推动了医学视觉问答（VQA）的进步，但现有的大部分CoT推理都是自由形式的，未能捕捉临床医生所遵循的结构化推理过程。本研究提出了一个问题：可追溯的多步骤推理监督是否能提高医学VQA的推理准确性和可解释性？为此，我们引入了Step-CoT，这是一个大规模的医学推理数据集，包含经过专家挑选、符合临床诊断工作流程的结构化多步骤CoT，隐式地将模型的推理基础建立在放射学证据之上。Step-CoT包含超过10,000个真实临床案例和70,000个围绕诊断工作流程组织的VQA对，提供了监督的中间步骤，引导模型遵循有效的推理轨迹。为了有效地从Step-CoT中学习，我们进一步引入了一个教师-学生框架，配备动态图结构聚焦机制，优先考虑具有诊断信息的步骤，同时过滤掉不相关的上下文。我们的实验表明，使用Step-CoT可以提高推理准确性和可解释性。基准：github.com/hahaha111111/Step-CoT。数据集卡片：huggingface.co/datasets/fl-15o/Step-CoT

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2603.13879

Dual-Strategy Improvement of YOLOv11n for Multi-Scale Object Detection in Remote Sensing Images

针对遥感图像中多尺度目标检测的YOLOv11n双策略改进

Zhu, Shuaiyu, Ablameyko, Sergey

Abstract

Satellite remote sensing images pose significant challenges for object detection due to their high resolution, complex scenes, and large variations in target scales. To address the insufficient detection accuracy of the YOLOv11n model in remote sensing imagery, this paper proposes two improvement strategies. Method 1: (a) a Large Separable Kernel Attention (LSKA) mechanism is introduced into the backbone network to enhance feature extraction for small objects; (b) a Gold-YOLO structure is incorporated into the neck network to achieve multi-scale feature fusion, thereby improving the detection performance of objects at different scales. Method 2: (a) the Gold-YOLO structure is also integrated into the neck network; (b) a MultiSEAMHead detection head is combined to further strengthen the representation and detection capability for small and multi-scale objects. To verify the effectiveness of the proposed improvements, experiments are conducted on the DOTAv1 dataset. The results show that, while maintaining the lightweight advantage of the model, the proposed methods improve detection accuracy ([email protected]) by 1.3% and 1.8%, respectively, compared with the baseline YOLOv11n, demonstrating the effectiveness and practical value of the proposed approaches for object detection in remote sensing images.

Chinese Translation

卫星遥感图像由于其高分辨率、复杂场景和目标尺度的大幅变化，给目标检测带来了显著挑战。为了解决YOLOv11n模型在遥感图像中检测精度不足的问题，本文提出了两种改进策略。方法一：(a) 在主干网络中引入大型可分离卷积注意力机制（Large Separable Kernel Attention, LSKA），以增强小目标的特征提取；(b) 在颈部网络中整合金色YOLO结构（Gold-YOLO），实现多尺度特征融合，从而提高不同尺度目标的检测性能。方法二：(a) 同样将金色YOLO结构集成到颈部网络中；(b) 结合多SEAM头（MultiSEAMHead）检测头，进一步增强对小目标和多尺度目标的表示和检测能力。为了验证所提改进的有效性，本文在DOTAv1数据集上进行了实验。结果表明，在保持模型轻量级优势的同时，所提方法分别提高了检测精度（[email protected]）1.3%和1.8%，与基线YOLOv11n相比，展示了所提方法在遥感图像目标检测中的有效性和实际价值。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2603.13884

SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis

SCoCCA：通过典型相关分析的多模态稀疏概念分解

Gordon, Ehud, Levi, Meir Yossef, Gilboa, Guy

Abstract

Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model's behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.

Chinese Translation

解释视觉-语言模型的内部推理对于在安全关键领域部署人工智能至关重要。基于概念的可解释性通过以语义上有意义的组件表示模型的行为，提供了一种与人类对齐的视角。然而，现有的方法在很大程度上局限于图像，并忽视了跨模态的交互。文本-图像嵌入（如CLIP生成的嵌入）存在模态差距，视觉和文本特征遵循不同的分布，限制了可解释性。典型相关分析（CCA）提供了一种原则性的方法来对齐来自不同分布的特征，但尚未用于多模态概念级分析。我们表明，CCA和InfoNCE的目标密切相关，优化CCA隐式地优化InfoNCE，提供了一种简单的、无训练的机制来增强跨模态对齐，而不影响预训练的InfoNCE目标。基于这一观察，我们将基于概念的可解释性与CCA结合，提出了概念CCA（CoCCA），这是一个对齐跨模态嵌入的框架，同时实现可解释的概念分解。我们进一步扩展了该框架，提出了稀疏概念CCA（SCoCCA），该方法强制稀疏性以生成更解耦和更具区分性的概念，从而促进改进的激活、消融和语义操作。我们的方法将基于概念的解释推广到多模态嵌入，并在概念发现方面实现了最先进的性能，这通过重建和操作任务（如概念消融）得到了验证。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2603.13886

Multi-Modal Character Localization and Extraction for Chinese Text Recognition

中文文本识别的多模态字符定位与提取

Li, Qilong, Zhang, Chongsheng

Abstract

Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.

Chinese Translation

场景文本识别（STR）方法在英语文本图像中展现了其卓越的能力。然而，由于汉字复杂的内部结构和广泛的字符类别，识别图像中的中文文本面临挑战。最近的研究表明，针对英语文本识别设计的方法在识别中文文本图像时遇到了准确性瓶颈。这引发了一个问题：将为英语开发的模型应用于中文STR任务是否合适？为了解决这个问题，我们提出了一种新方法，命名为 LER，它明确地解耦每个字符，并在考虑汉字复杂内部结构的同时独立识别字符。LER 由三个模块组成：定位、提取和识别。首先，定位模块利用多模态信息精确确定字符的位置。然后，提取模块并行地分离所有字符。最后，识别模块考虑汉字独特的内部结构，以提供文本预测结果。在大规模中文基准上进行的广泛实验表明，我们的方法显著优于现有方法。此外，在六个英语基准和 Union14M 基准上进行的广泛实验显示，LER 在英语文本识别中也取得了令人印象深刻的结果。代码可在 https://github.com/Pandarenlql/LER 获取。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2603.13901

CT-Conditioned Diffusion Prior with Physics-Constrained Sampling for PET Super-Resolution

基于物理约束采样的CT条件扩散先验用于PET超分辨率

Yang, Liutao, Wang, Zi, Jing, Peiyuan, Wang, Xiaowen, Montoya-Zegarra, Javier A., Shi, Kuangyu, Zhang, Daoqiang, Yang, Guang

Abstract

PET super-resolution is highly under-constrained because paired multi-resolution scans from the same subject are rarely available, and effective resolution is determined by scanner-specific physics (e.g., PSF, detector geometry, and acquisition settings). This limits supervised end-to-end training and makes purely image-domain generative restoration prone to hallucinated structures when anatomical and physical constraints are weak. We formulate PET super-resolution as posterior inference under heterogeneous system configurations and propose a CT-conditioned diffusion framework with physics-constrained sampling. During training, a conditional diffusion prior is learned from high-quality PET/CT pairs using cross-attention for anatomical guidance, without requiring paired LR--HR PET data. During inference, measurement consistency is enforced through a scanner-aware forward model with explicit PSF effects and gradient-based data-consistency refinement. Under both standard and OOD settings, the proposed method consistently improves experimental metrics and lesion-level clinical relevance indicators over strong baselines, while reducing hallucination artifacts and improving structural fidelity.

Chinese Translation

PET超分辨率受到高度约束，因为来自同一受试者的配对多分辨率扫描很少可用，且有效分辨率由扫描仪特定的物理特性（例如，点扩散函数（PSF）、探测器几何形状和采集设置）决定。这限制了监督式端到端训练，并使得在解剖和物理约束较弱时，纯图像域生成恢复容易产生幻觉结构。我们将PET超分辨率形式化为在异构系统配置下的后验推断，并提出了一种基于CT条件的扩散框架，结合物理约束采样。在训练过程中，从高质量的PET/CT配对中学习条件扩散先验，利用交叉注意力进行解剖指导，而无需配对的低分辨率-高分辨率（LR-HR）PET数据。在推断过程中，通过具有显式PSF效应的扫描仪感知前向模型强制测量一致性，并进行基于梯度的数据一致性精炼。在标准和超出分布（OOD）设置下，所提出的方法在实验指标和病灶级临床相关性指标上始终优于强基线，同时减少幻觉伪影并提高结构保真度。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2603.13904

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

一种基于单个标记的像素级场景理解：视觉状态需要什么-在哪里的组合

Lee, Seokmin, Lee, Yunghee, Pak, Byeonghyun, Woo, Byeongju

Abstract

For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.

Chinese Translation

对于在动态环境中操作的机器人代理，从流媒体视频观察中学习视觉状态表示对于顺序决策至关重要。最近的自监督学习方法在视觉任务中表现出强大的迁移能力，但它们并未明确解决良好视觉状态应编码的内容。我们认为，有效的视觉状态必须通过联合编码场景元素的语义身份及其空间位置来捕捉什么-在哪里，从而能够可靠地检测观察中的微妙动态。为此，我们提出了CroBo，一个基于全局到局部重建目标的视觉状态表示学习框架。在给定一个压缩为紧凑瓶颈标记的参考观察的情况下，CroBo学习从稀疏可见线索中重建局部目标裁剪中的严重遮挡补丁，利用全局瓶颈标记作为上下文。这一学习目标鼓励瓶颈标记编码场景范围内语义实体的细粒度表示，包括它们的身份、空间位置和配置。因此，学习到的视觉状态揭示了场景元素如何随时间移动和互动，支持顺序决策。我们在多样的基于视觉的机器人策略学习基准上评估了CroBo，结果显示其达到了最先进的性能。重建分析和感知直线性实验进一步表明，学习到的表示保留了像素级的场景组成，并编码了观察中的什么-移动-在哪里。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2603.13910

Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

绝对尺度下的场景生成：利用文本的语义和几何指导实现准确且可解释的3D室内场景生成

Ainetter, Stefan, Deixelberger, Thomas, Dominici, Edoardo A., Drescher, Philipp, Vardis, Konstantinos, Steinberger, Markus

Abstract

We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360{\deg} imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.

Chinese Translation

我们提出了GuidedSceneGen，一个文本到3D的生成框架，能够生成度量上准确、全球一致且语义可解释的室内场景。与以往常常遭受几何漂移或尺度模糊的文本驱动方法不同，我们的方法在整个生成过程中保持绝对的世界坐标框架。从文本场景描述开始，我们预测一个全球3D布局，编码了语义和几何结构，作为下游阶段的指导代理。然后，一个基于语义和深度条件的全景扩散模型合成与全球布局对齐的360°图像，显著提高了空间一致性。为了探索未观察到的区域，我们采用了一个视频扩散模型，该模型由优化的相机轨迹指导，平衡覆盖率和碰撞避免，与全面路径探索相比，采样速度提高了多达10倍。生成的视图通过3D高斯点云融合，产生一个一致且完全可导航的绝对尺度3D场景。GuidedSceneGen能够准确地将对象姿态和语义标签从布局转移到重建，并支持在不重新对齐的情况下进行渐进式场景扩展。定量结果和用户研究表明，与最近的全景文本到3D基线相比，具有更高的3D一致性和布局合理性。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2603.13912

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

朝着稳定的自监督物体表征在无约束的自我中心视频中

Tan, Yuting, Cheng, Xilong, Qin, Yunxiao, Li, Zhengnan, Zhang, Jingjing

Abstract

Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Chinese Translation

人类通过感知和与环境互动来发展视觉智能——这一过程是基于自我中心经验的自监督学习。受到此启发，我们提出一个问题：人工系统如何能够在不依赖手动标注的情况下，从连续的、未经整理的第一人称视频中学习稳定的物体表征？这一设置面临着在杂乱、遮挡和自我运动中分离、识别和持续跟踪物体的挑战。我们提出了EgoViT，一个统一的视觉Transformer框架，旨在从未标记的自我中心视频中学习稳定的物体表征。EgoViT通过三种协同机制共同发现和稳定“原型物体”，从而启动这一学习过程：(1) 原型物体学习（Proto-object Learning），利用帧内蒸馏形成区分性表征；(2) 深度正则化（Depth Regularization），将这些表征与几何结构相结合；(3) 教师过滤的时间一致性（Teacher-Filtered Temporal Consistency），在时间上强制保持身份一致性。这创造了一个良性循环，使得初步的物体假设逐步被精炼为稳定、持久的表征。该框架在未标记的第一人称视频上进行端到端训练，并对不同来源和质量的几何先验表现出鲁棒性。在标准基准测试中，EgoViT在无监督物体发现中实现了+8.0%的CorLoc提升，在语义分割中实现了+4.8%的mIoU提升，展示了其为具身智能中的鲁棒视觉抽象奠定基础的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2603.13917

Evaluation of Visual Place Recognition Methods for Image Pair Retrieval in 3D Vision and Robotics

三维视觉与机器人领域中图像对检索的视觉地点识别方法评估

Haitz, Dennis, Shetty, Athradi Shritish, Weinmann, Michael, Ulrich, Markus

Abstract

Visual Place Recognition (VPR) is a core component in computer vision, typically formulated as an image retrieval task for localization, mapping, and navigation. In this work, we instead study VPR as an image pair retrieval front-end for registration pipelines, where the goal is to find top-matching image pairs between two disjoint image sets for downstream tasks such as scene registration, SLAM, and Structure-from-Motion. We comparatively evaluate state-of-the-art VPR families - NetVLAD-style baselines, classification-based global descriptors (CosPlace, EigenPlaces), feature-mixing (MixVPR), and foundation-model-driven methods (AnyLoc, SALAD, MegaLoc) - on three challenging datasets: object-centric outdoor scenes (Tanks and Temples), indoor RGB-D scans (ScanNet-GS), and autonomous-driving sequences (KITTI). We show that modern global descriptor approaches are increasingly suitable as off-the-shelf image pair retrieval modules in challenging scenarios including perceptual aliasing and incomplete sequences, while exhibiting clear, domain-dependent strengths and weaknesses that are critical when choosing VPR components for robust mapping and registration.

Chinese Translation

视觉地点识别（VPR）是计算机视觉中的核心组成部分，通常被表述为定位、映射和导航的图像检索任务。在本研究中，我们将VPR视为注册管道的图像对检索前端，目标是在两个不相交的图像集中找到最佳匹配的图像对，以支持后续任务，如场景注册、同步定位与地图构建（SLAM）以及运动重建（Structure-from-Motion）。我们对最先进的VPR方法进行了比较评估，包括NetVLAD风格的基线、基于分类的全局描述符（CosPlace、EigenPlaces）、特征混合（MixVPR）以及基于基础模型的方法（AnyLoc、SALAD、MegaLoc），在三个具有挑战性的数据集上进行测试：以物体为中心的户外场景（Tanks and Temples）、室内RGB-D扫描（ScanNet-GS）和自主驾驶序列（KITTI）。我们展示了现代全局描述符方法在包括感知混淆和不完整序列等挑战场景中，越来越适合作为现成的图像对检索模块，同时展现出明显的领域相关优势和劣势，这在选择VPR组件以实现稳健的映射和注册时至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2603.13919

OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction

OpenCOOD-Air：通过空间转换和偏移预测促进异构地面-空中协同感知

Wu, Xianke, Bai, Songlin, Li, Chengxiang, Luo, Zhiyao, Tian, Yulin, Zhu, Fenghua, Lv, Yisheng, Tian, Yonglin

Abstract

While Vehicle-to-Vehicle (V2V) collaboration extends sensing ranges through multi-agent data sharing, its reliability remains severely constrained by ground-level occlusions and the limited perspective of chassis-mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD-Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground-air domain gaps and data sparsity, we adopt a transfer learning strategy to fine-tune UAV weights from pre-trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground-air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross-Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V-Air benchmark to validate the transition from V2V to Vehicle-to-Vehicle-to-UAV. Compared to state-of-the-art methods, our approach improves 2D and 3D [email protected] by 4% and 7%, respectively.

Chinese Translation

尽管车辆间协作（Vehicle-to-Vehicle, V2V）通过多智能体数据共享扩展了感知范围，但其可靠性仍受到地面遮挡和底盘安装传感器视角有限的严重制约，这常常导致关键的感知盲区。我们提出了OpenCOOD-Air，一个将无人机（UAV）作为可扩展平台集成到V2V协同感知中的新框架，以克服这些限制。为了减轻来自地面-空中领域差距和数据稀疏性带来的梯度干扰，我们采用迁移学习策略，从预训练的V2V模型中微调UAV权重。为了防止这一过渡中固有的空间信息丢失，我们将地面-空中协同感知形式化为一个具有明确高度监督的异构集成任务，并引入了跨域空间转换器（Cross-Domain Spatial Converter, CDSC）和空间偏移预测变换器（Spatial Offset Prediction Transformer, SOPT）。此外，我们提出了OPV2V-Air基准，以验证从V2V到车辆-车辆-无人机（Vehicle-to-Vehicle-to-UAV）的过渡。与最先进的方法相比，我们的方法在2D和3D [email protected]上分别提高了4%和7%。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2603.13928

Discriminative Flow Matching Via Local Generative Predictors

通过局部生成预测器进行判别流匹配

Jha, Om Govind, Bamniya, Manoj, Borthakur, Ayon

Abstract

Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold -- such as class embeddings or bounding box coordinates -- we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.

Chinese Translation

传统的判别计算机视觉主要依赖静态投影，将输入特征映射到输出的单一步骤中。尽管这种范式高效，但缺乏生物视觉和现代生成建模所固有的迭代精炼和鲁棒性。本文提出了判别流匹配（Discriminative Flow Matching），一个将分类和目标检测重新表述为条件传输过程的框架。通过学习一个向量场，该向量场持续地将样本从简单的噪声分布传输到与任务对齐的目标流形——例如类别嵌入或边界框坐标——我们处于生成学习和判别学习之间的交界面。我们的方法将多个独立的流预测器附加到共享的主干网络。这些预测器使用局部流匹配目标进行训练，其中每个块的梯度独立计算。我们将这种方法制定为标准图像分类，并扩展到复杂的目标检测任务，其中目标是高维且空间分布的。该架构提供了灵活性，可以顺序更新块以最小化激活内存，或并行更新以适应不同的硬件约束。通过聚合这些独立流预测器的预测，我们的框架能够在包括卷积神经网络（CNN）和视觉变换器（vision transformers）在内的多种架构中实现鲁棒的、受生成启发的推理。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2603.13941

Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting

高分辨率RGB与低分辨率HSI的双向交叉注意力融合用于多模态自动化垃圾分类

Funk, Jonas V., Roming, Lukas, Michel, Andreas, Bäcker, Paul, Maier, Georg, Längle, Thomas, Klute, Markus

Abstract

Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).

Chinese Translation

日益增长的垃圾流和向循环经济的转型需要高效的自动化垃圾分类。在工业环境中，材料在快速传送带上移动，可靠的识别和排除要求像素级的精确分割。RGB成像提供了高分辨率的空间细节，这对于准确分割至关重要，但在可见光谱中会混淆看似相似的材料。高光谱成像（HSI）提供了分离这些材料的光谱特征，但其较低的空间分辨率限制了细节。因此，有效的垃圾分类需要融合两种模态的方法，以利用它们的互补优势。我们提出了双向交叉注意力融合（BCAF），该方法通过局部的双向交叉注意力将高分辨率RGB与低分辨率HSI在其原生网格上对齐，避免了预上采样或早期光谱崩溃。BCAF使用两个独立的骨干网络：一个标准的Swin Transformer用于RGB，另一个适应HSI的Swin骨干网络通过3D标记化和光谱自注意力保留光谱结构。我们还分析了RGB输入分辨率与HSI光谱切片数量之间的权衡。尽管我们的评估目标是RGB-HSI融合，但BCAF是模态无关的，适用于与低分辨率、高通道辅助传感器共同注册的RGB。在基准数据集SpectralWaste上，BCAF以31帧/秒的速度达到了76.4%的mIoU，55帧/秒时达到了75.4%的mIoU。我们进一步评估了一个新型工业数据集：K3I-Cycling（首个RGB子集已在Fordatis上发布）。在该数据集上，BCAF在材料分割（纸张、金属、塑料等）中达到了62.3%的mIoU，在塑料类型分割（PET、PP、HDPE、LDPE、PS等）中达到了66.2%的mIoU。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2603.13943

Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing

Sat-JEPA-Diff：桥接自监督学习与生成扩散在遥感中的应用

Komurcu, Kursat, Petkevicius, Linas

Abstract

Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU-AIML/SAT-JEPA-DIFF.

Chinese Translation

预测卫星图像需要在结构准确性和纹理细节之间取得平衡。标准的确定性方法如PredRNN或SimVP最小化基于像素的误差，但存在“回归均值”问题，导致生成模糊的输出，掩盖微妙的地理空间特征。生成模型提供逼真的纹理，但往往误导性地揭示结构异常。为了解决这一问题，我们提出了Sat-JEPA-Diff，它将自监督学习（Self-Supervised Learning, SSL）与隐式扩散模型（Latent Diffusion Models, LDM）相结合。IJEPA模块预测稳定的语义表示，然后通过轻量级的交叉注意力适配器引导冻结的稳定扩散骨干网络。这确保合成的高精度纹理基于绝对准确的结构预测。在全球Sentinel-2数据集上的评估中，Sat-JEPA-Diff在解析锐利边界方面表现出色。它在感知评分上取得领先（GSSIM: 0.8984, FID: 0.1475），并显著优于确定性基线，尽管标准自回归稳定性限制。代码和数据集已公开发布在https://github.com/VU-AIML/SAT-JEPA-DIFF。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2603.13951

DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

DCP-CLIP：一种基于双重交互的开放词汇语义分割的粗到细框架

Wang, Jing, Shi, Huimin, Zhou, Quan, Liu, Qibo, Zhang, Suofei, Lu, Huimin

Abstract

The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

Chinese Translation

近年来，基于视觉-语言基础模型的开放词汇语义分割（OVSS）取得了显著发展，但仍面临以下基本挑战：（1）文本和视觉空间之间的跨模态通信不足，以及（2）与大量类别互动带来的显著计算成本。为了解决这些问题，本文提出了一种新颖的粗到细框架，称为 DCP-CLIP，用于 OVSS。与主要依赖预先建立的类别内容及 CLIP 的固有空间-类别互动能力的以往工作不同，我们动态构建与类别相关的文本特征，并明确建模空间图像特征与文本类别语义之间的双重互动。具体而言，我们首先利用 CLIP 的开放词汇识别能力识别与图像上下文相关的语义类别，并动态生成相应的文本特征，以作为初始文本指导。随后，我们通过跨模态整合文本指导中的语义信息到视觉表示中进行粗分割，并通过整合来自编码器的空间增强特征来实现更精细的分割，以恢复细粒度细节并提升空间分辨率。最后，我们利用分割方的空间信息来细化每个掩膜的类别预测，从而促进更精确的语义标注。在多个 OVSS 基准测试上的实验结果表明，DCP-CLIP 通过提供更高的准确性和更高的效率，优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2603.13960

IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation

IMS3：打破基于扩散的数据集蒸馏中的分布聚合

Wang, Chenru, Chen, Yunyi, Yang, Zijun, Zhou, Joey Tianyi, Zhang, Chi

Abstract

Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling(S^3) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.

Chinese Translation

数据集蒸馏旨在合成紧凑的数据集，以近似大规模真实数据集的训练效果，为现代深度学习日益增长的计算需求提供高效解决方案。最近，基于扩散的数据集蒸馏方法通过利用扩散模型强大的生成能力，展示了良好的前景，能够生成多样且结构一致的样本。然而，仍然存在一个根本性的目标不一致：扩散模型是针对生成似然进行优化，而非判别效用，导致在高密度区域过度集中，并且对分类至关重要的边界样本覆盖不足。为了解决这个问题，我们提出了两种互补策略。反演匹配（Inversion-Matching, IM）引入了一种反演引导的微调过程，使去噪轨迹与其反演对应物对齐，从而扩大分布覆盖并增强多样性。选择性子组采样（Selective Subgroup Sampling, S^3）是一种无训练的采样机制，通过选择既具代表性又具独特性的合成子集来提高类间可分离性。大量实验表明，我们的方法显著提升了蒸馏数据集的判别质量和泛化能力，在基于扩散的方法中实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2603.13961

USIS-PGM: Photometric Gaussian Mixtures for Underwater Salient Instance Segmentation

USIS-PGM：用于水下显著实例分割的光度高斯混合模型

Hong, Lin, Yao, Xiangtong, Bozkurt, Mürüvvet, Wang, Xin, Zhang, Fumin

Abstract

Underwater salient instance segmentation (USIS) is crucial for marine robotic systems, as it enables both underwater salient object detection and instance-level mask prediction for visual scene understanding. Compared with its terrestrial counterpart, USIS is more challenging due to the underwater image degradation. To address this issue, this paper proposes USIS-PGM, a single-stage framework for USIS. Specifically, the encoder enhances boundary cues through a frequency-aware module and performs content-adaptive feature reweighting via a dynamic weighting module. The decoder incorporates a Transformer-based instance activation module to better distinguish salient instances. In addition, USIS-PGM employs multi-scale Gaussian heatmaps generated from ground-truth masks through Photometric Gaussian Mixture (PGM) to supervise intermediate decoder features, thereby improving salient instance localization and producing more structurally coherent mask predictions. Experimental results demonstrate the superiority and practical applicability of the proposed USIS-PGM model.

Chinese Translation

水下显著实例分割（USIS）对于海洋机器人系统至关重要，因为它能够实现水下显著物体检测和实例级掩膜预测，从而促进视觉场景理解。与陆地对应任务相比，由于水下图像的退化，USIS 更具挑战性。为了解决这一问题，本文提出了 USIS-PGM，一种用于 USIS 的单阶段框架。具体而言，编码器通过频率感知模块增强边界线索，并通过动态加权模块执行内容自适应特征重加权。解码器结合了基于 Transformer 的实例激活模块，以更好地区分显著实例。此外，USIS-PGM 采用通过光度高斯混合（PGM）从真实掩膜生成的多尺度高斯热图来监督中间解码器特征，从而改善显著实例定位并生成更具结构一致性的掩膜预测。实验结果证明了所提出的 USIS-PGM 模型的优越性和实际应用性。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2603.13964

VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

VID-AD：一种用于视觉诱导干扰下图像级逻辑异常检测的数据集

Nakata, Hiroto, Zou, Yawen, Sakai, Shunsuke, Maeda, Shun, Gu, Chunzhi, Wei, Yijin, Gao, Shangce, Zhang, Chao

Abstract

Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.

Chinese Translation

工业检测中的逻辑异常检测仍然面临挑战，因为视觉外观的变化（例如，背景杂乱、光照变化和模糊）常常使以视觉为中心的检测器无法识别规则级别的违规行为。然而，现有的基准很少提供在逻辑状态固定而干扰因素变化的受控环境下进行评估。为了解决这一问题，我们引入了VID-AD，一个用于视觉诱导干扰下逻辑异常检测的数据集。该数据集包含10个制造场景和5种捕获条件，总计50个单类任务和10,395张图像。每个场景由从数量、长度、类型、位置和关系中选择的两个逻辑约束定义，异常包括单约束和组合违规。我们进一步提出了一种基于语言的异常检测框架，该框架仅依赖于从正常图像生成的文本描述。通过使用对比学习，结合正向文本和基于矛盾的负向文本，我们的方法学习到的嵌入能够捕捉逻辑属性，而非低级特征。大量实验表明，在评估的设置中，我们的方法在基线之上表现出一致的改进。数据集可在以下网址获取：https://github.com/nkthiroto/VID-AD。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2603.13969

Leveraging a Statistical Shape Model for Efficient Generation of Annotated Training Data: A Case Study on Liver Landmarks Segmentation

利用统计形状模型高效生成标注训练数据：以肝脏标志物分割为案例研究

Krnjaca, Denis, Krames, Lorena, Nahm, Werner

Abstract

Anatomical landmark segmentation serves as a critical initial step for robust multimodal registration during computer-assisted interventions. Current approaches predominantly rely on deep learning, which often necessitates the extensive manual generation of annotated datasets. In this paper, we present a novel strategy for creating large annotated datasets using a statistical shape model (SSM) based on a mean shape that is manually labeled only once. We demonstrate the method's efficacy through its application to deep-learning-based anatomical landmark segmentation, specifically targeting the detection of the anterior ridge and the falciform ligament in 3D liver shapes. A specialized deep learning network was trained with 8,800 annotated liver shapes generated by the SSM. The network's performance was evaluated on 500 unseen synthetic SSM shapes, yielding a mean Intersection over Union of 91.4% (87.4% for the anterior ridge and 87.6% for the falciform ligament). Subsequently, the network was applied to clinical patient liver shapes, with qualitative evaluation indicating promising results and highlighting the generalizability of the proposed approach. Our findings suggest that the SSM-based data generation approach alleviates the labor-intensive process of manual labeling while enabling the creation of large annotated training datasets for machine learning. Although our study focuses on liver anatomy, the proposed methodology holds potential for a broad range of applications where annotated training datasets play a pivotal role in developing accurate deep-learning models.

Chinese Translation

解剖标志物分割是计算机辅助干预中进行稳健多模态配准的关键初步步骤。目前的方法主要依赖深度学习，这通常需要大量手动生成标注数据集。在本文中，我们提出了一种基于统计形状模型（SSM）创建大规模标注数据集的新策略，该模型基于仅手动标注一次的均值形状。我们通过将该方法应用于基于深度学习的解剖标志物分割，特别是针对3D肝脏形状中前缘和镰状韧带的检测，展示了该方法的有效性。我们使用SSM生成的8,800个标注肝脏形状训练了一个专门的深度学习网络。该网络在500个未见的合成SSM形状上进行性能评估，得到的平均交并比为91.4%（前缘为87.4%，镰状韧带为87.6%）。随后，该网络被应用于临床患者的肝脏形状，定性评估显示出良好的结果，并突显了所提方法的普适性。我们的研究结果表明，基于SSM的数据生成方法减轻了手动标注的劳动强度，同时能够为机器学习创建大规模的标注训练数据集。尽管我们的研究集中于肝脏解剖，但所提方法在需要标注训练数据集的广泛应用中具有潜在价值，这些数据集在开发准确的深度学习模型中发挥着关键作用。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2603.13978

When Visual Privacy Protection Meets Multimodal Large Language Models

视觉隐私保护与多模态大型语言模型的交汇

Hui, Xiaofei, Wu, Qian, Qu, Haoxuan, Mirmehdi, Majid, Rahmani, Hossein, Liu, Jun

Abstract

The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a "black box", i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM's performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.

Chinese Translation

多模态大型语言模型（MLLMs）的出现以及GPT-4V等MLLM云服务的广泛使用引发了对视觉数据隐私泄露的重大关注。由于这些模型通常部署在云服务中，用户需要提交他们的图像和视频，这带来了严重的隐私风险。然而，如何解决这些隐私问题仍然是一个未被充分探索的问题。因此，在本文中，我们旨在进行一项新的研究，以保护视觉隐私，同时享受MLLM服务带来的便利。我们关注的实际情况是，MLLM是一个“黑箱”，即我们只能访问其输入和输出，而无法了解其内部模型信息。为了解决这一具有挑战性但又迫切需要解决的问题，我们提出了一个新颖的框架，在该框架中，我们仔细设计了学习目标，以帕累托最优性为基础，寻求视觉隐私与MLLM性能之间的更好平衡，并提出了关键历史增强优化，以有效优化与黑箱MLLM的框架。我们的实验表明，我们的方法在不同基准上是有效的。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2603.13993

VAD4Space: Visual Anomaly Detection for Planetary Surface Imagery

VAD4Space：行星表面影像的视觉异常检测

Genilotti, Fabrizio, Stropeni, Arianna, Borsatti, Francesco, Barusco, Manuel, Pezze, Davide Dalle, Susto, Gian Antonio

Abstract

Space missions generate massive volumes of high-resolution orbital and surface imagery that far exceed the capacity for manual inspection. Detecting rare phenomena is scientifically critical, yet traditional supervised learning struggles due to scarce labeled examples and closed-world assumptions that prevent discovery of genuinely novel observations. In this work, we investigate Visual Anomaly Detection (VAD) as a framework for automated discovery in planetary exploration. We present the first empirical evaluation of state-of-the-art feature-based VAD methods on real planetary imagery, encompassing both orbital lunar data and Mars rover surface imagery. To support this evaluation, we introduce two benchmarks: (i) a lunar dataset derived from Lunar Reconnaissance Orbiter Camera Narrow Angle imagery, comprising of fresh and degraded craters as anomalies alongside normal terrain; and (ii) a Mars surface dataset designed to reflect the characteristics of rover-acquired imagery. We evaluate multiple VAD approaches with a focus on computationally efficient, edge-oriented solutions suitable for onboard deployment, applicable to both orbital platforms surveying the lunar surface and surface rovers operating on Mars. Our results demonstrate that feature-based VAD methods can effectively identify rare planetary surface phenomena while remaining feasible for resource-constrained environments. By grounding anomaly detection in planetary science, this work establishes practical benchmarks and highlights the potential of open-world perception systems to support a range of mission-critical applications, including tactical planning, landing site selection, hazard detection, bandwidth-aware data prioritization, and the discovery of unanticipated geological processes.

Chinese Translation

太空任务生成的大量高分辨率轨道和表面影像远远超出了人工检查的能力。检测稀有现象在科学上至关重要，但传统的监督学习由于标记样本稀缺和封闭世界假设而难以发现真正的新颖观察。在本研究中，我们探讨了视觉异常检测（Visual Anomaly Detection, VAD）作为行星探索中自动发现的框架。我们首次对最先进的基于特征的 VAD 方法在真实行星影像上的实证评估，包括轨道月球数据和火星探测器表面影像。为支持这一评估，我们引入了两个基准： (i) 一个来自月球勘测轨道器相机窄角影像的月球数据集，包含新鲜和退化的陨石坑作为异常，以及正常地形； (ii) 一个设计用于反映探测器获取影像特征的火星表面数据集。我们评估了多种 VAD 方法，重点关注适合于机载部署的计算效率高、以边缘为导向的解决方案，适用于对月球表面进行调查的轨道平台和在火星上操作的表面探测器。我们的结果表明，基于特征的 VAD 方法能够有效识别稀有的行星表面现象，同时在资源受限的环境中仍然可行。通过将异常检测与行星科学结合，本研究建立了实用基准，并强调开放世界感知系统在支持一系列任务关键应用中的潜力，包括战术规划、着陆点选择、危险检测、带宽感知的数据优先级排序，以及意想不到的地质过程的发现。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2603.13994

Human-like Object Grouping in Self-supervised Vision Transformers

类人对象分组的自监督视觉变换器

Adeli, Hossein, Ahn, Seoyoung, Luo, Andrew, Zhang, Mengmi, Kriegeskorte, Nikolaus, Zelinsky, Gregory

Abstract

Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.

Chinese Translation

通过自监督目标训练的视觉基础模型在多种任务中表现出色，并展现出新兴的对象分割特性。然而，它们与人类对象感知的对齐程度仍然不甚清楚。在此，我们引入了一种行为基准，参与者需要对自然场景中的点对进行同/异对象判断，将经典的心理物理学范式扩展到超过1000次试验。我们使用从视觉模型的表示中提取的简单输出，测试了一组多样的视觉模型，以预测受试者的反应时间。我们观察到模型代际间的持续改进，架构和训练目标均对对齐产生影响，而采用DINO自监督目标训练的基于变换器的模型表现最强。为了探究这种改进的来源，我们提出了一种新颖的度量标准，通过测量对象内和对象间的补丁相似性来量化表示的对象中心成分。在各模型中，较强的对象中心结构更准确地预测人类的分割行为。我们进一步表明，通过蒸馏将监督变换器模型的Gram矩阵（捕捉图像补丁间的相似性结构）与自监督模型的Gram矩阵匹配，可以改善它们与人类行为的对齐，符合之前发现的Gram锚定改善DINOv3特征质量的结论。综上所述，这些结果表明，自监督视觉模型以类人方式捕捉对象结构，而Gram矩阵结构在驱动感知对齐中发挥了作用。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2603.14001

PhyGaP: Physically-Grounded Gaussians with Polarization Cues

PhyGaP：具有极化线索的物理基础高斯模型

Wu, Jiale, Bai, Xiaoyang, He, Zongqi, Xu, Weiwei, Peng, Yifan

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via deferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of shape and material information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability. Our code will be released soon.

Chinese Translation

最近在三维高斯点云（3D Gaussian Splatting, 3DGS）方面的进展在建模反射性三维物体及其与环境的交互方面取得了显著成功，尤其是通过延迟渲染（Deferred Rendering, DR）技术。然而，现有方法通常难以正确重建物理属性，如反射率和反射光，因此不支持高保真度的重光照。我们观察到这一限制源于RGB图像中缺乏形状和材料信息，因此我们提出了PhyGaP，一种基于物理的3DGS方法，利用极化线索促进精确的反射分解和重建物体的视觉一致性重光照。具体而言，我们设计了一种通过反射建模极化的极化延迟渲染（Polarimetric Deferred Rendering, PolarDR）过程，以及一种自遮挡感知的环境图构建技术（GridMap），以解决非凸物体的间接光照问题。我们在多个合成和真实场景上进行了验证，包括仅具有部分极化线索的场景，结果表明PhyGaP不仅在重建反射性三维物体的外观和表面法线方面表现优异（在PSNR上平均比现有基于RGB的方法提高约2 dB，在余弦距离上提高45.7%），而且还实现了最先进的逆渲染和重光照能力。我们的代码将很快发布。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2603.14004

U-Face: An Efficient and Generalizable Framework for Unsupervised Facial Attribute Editing via Subspace Learning

U-Face：一种高效且可泛化的无监督面部属性编辑框架，通过子空间学习实现

Liu, Bo, Cui, Xuan, Zeng, Run, Duan, Wei, Liu, Chongwen, Qian, Jinrui, Tang, Lianggui, Gan, Hongping

Abstract

Latent space-based facial attribute editing methods have gained popularity in applications such as digital entertainment, virtual avatar creation, and human-computer interaction systems due to their potential for efficient and flexible attribute manipulation, particularly for continuous edits. Among these, unsupervised latent space-based methods, which discover effective semantic vectors without relying on labeled data, have attracted considerable attention in the research community. However, existing methods still encounter difficulties in disentanglement, as manipulating a specific facial attribute may unintentionally affect other attributes, complicating fine-grained controllability. To address these challenges, we propose a novel framework designed to offer an effective and adaptable solution for unsupervised facial attribute editing, called Unsupervised Facial Attribute Controllable Editing (U-Face). The proposed method frames semantic vector learning as a subspace learning problem, where latent vectors are approximated within a lower-dimensional semantic subspace spanned by a semantic vector matrix. This formulation can also be equivalently interpreted from a projection-reconstruction perspective and further generalized into an autoencoder framework, providing a foundation that can support disentangled representation learning in a flexible manner. To improve disentanglement and controllability, we impose orthogonal non-negative constraints on the semantic vectors and incorporate attribute boundary vectors to reduce entanglement in the learned directions. Although these constraints make the optimization problem challenging, we design an alternating iterative algorithm, called Alternating Iterative Disentanglement and Controllability (AIDC), with closed-form updates and provable convergence under specific conditions.

Chinese Translation

基于潜在空间的面部属性编辑方法在数字娱乐、虚拟头像创建和人机交互系统等应用中越来越受到欢迎，因为它们在高效和灵活的属性操控方面具有潜力，特别是在连续编辑方面。在这些方法中，无监督的基于潜在空间的方法因其能够在不依赖标记数据的情况下发现有效的语义向量而受到研究界的广泛关注。然而，现有方法在解耦方面仍然面临困难，因为操控特定的面部属性可能会无意中影响其他属性，从而使得细粒度的可控性变得复杂。为了解决这些挑战，我们提出了一种新颖的框架，旨在为无监督面部属性编辑提供有效且可适应的解决方案，称为无监督面部属性可控编辑（U-Face）。该方法将语义向量学习框架化为一个子空间学习问题，其中潜在向量在由语义向量矩阵生成的低维语义子空间内进行近似。这种表述也可以从投影-重构的角度进行等效解释，并进一步推广到自编码器框架，为以灵活方式支持解耦表示学习提供基础。为了改善解耦性和可控性，我们对语义向量施加正交非负约束，并结合属性边界向量以减少学习方向中的纠缠。尽管这些约束使得优化问题变得复杂，但我们设计了一种交替迭代算法，称为交替迭代解耦与可控性（AIDC），该算法具有闭式更新和在特定条件下可证明的收敛性。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2603.14005

Towards Generalizable Deepfake Detection via Real Distribution Bias Correction

通过真实分布偏差校正实现可泛化的深伪检测

Liu, Ming-Hui, Cheng, Harry, Luo, Xin, Xu, Xin-Shun, Kankanhalli, Mohan S.

Abstract

To generalize deepfake detectors to future unseen forgeries, most existing methods attempt to simulate the dynamically evolving forgery types using available source domain data. However, predicting an unbounded set of future manipulations from limited prior examples is infeasible. To overcome this limitation, we propose to exploit the invariance of \textbf{real data} from two complementary perspectives: the fixed population distribution of the entire real class and the inherent Gaussianity of individual real images. Building on these properties, we introduce the Real Distribution Bias Correction (RDBC) framework, which consists of two key components: the Real Population Distribution Estimation module and the Distribution-Sampled Feature Whitening module. The former utilizes the independent and identically distributed (\iid) property of real samples to derive the normal distribution form of their statistics, from which the distribution parameters can be estimated using limited source domain data. Based on the learned population distribution, the latter utilizes the inherent Gaussianity of real data as a discriminative prior and performs a sampling-based whitening operation to amplify the Gaussianity gap between real and fake samples. Through synergistic coupling of the two modules, our model captures the real-world properties of real samples, thereby enhancing its generalizability to unseen target domains. Extensive experiments demonstrate that RDBC achieves state-of-the-art performance in both in-domain and cross-domain deepfake detection.

Chinese Translation

为了使深伪检测器能够泛化到未来未见的伪造作品，大多数现有方法试图利用可用的源领域数据来模拟动态演变的伪造类型。然而，从有限的先前示例中预测一个无限的未来操控集合是不可行的。为了解决这一限制，我们提出从两个互补的角度利用真实数据的恒定性：整个真实类别的固定人口分布和单个真实图像的固有高斯性。在这些特性基础上，我们引入了真实分布偏差校正（Real Distribution Bias Correction, RDBC）框架，该框架由两个关键组件组成：真实人口分布估计模块和分布采样特征白化模块。前者利用真实样本的独立同分布（independent and identically distributed, iid）特性推导其统计量的正态分布形式，从中可以使用有限的源领域数据估计分布参数。基于学习到的人口分布，后者利用真实数据的固有高斯性作为判别先验，并执行基于采样的白化操作，以放大真实样本与伪造样本之间的高斯性差距。通过这两个模块的协同耦合，我们的模型捕捉到真实样本的真实世界特性，从而增强了其对未见目标领域的泛化能力。大量实验表明，RDBC在领域内和跨领域的深伪检测中均实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2603.14012

Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification

领域泛化的人体重识别的多粒度视觉-语言对齐

Li, Jiachen, Gong, Xiaojin, Zhang, Dongping

Abstract

Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.

Chinese Translation

领域泛化的人体重识别（DG Re-ID）是一项具有挑战性的任务，其中模型在源领域上训练，但在未见过的目标领域上进行测试。尽管之前的纯视觉模型取得了显著进展，但性能仍有进一步提升的空间。最近，视觉-语言模型（VLMs）在各种视觉应用中展现出卓越的泛化能力。然而，直接将VLM应用于重识别任务的泛化改进有限。这是因为VLM仅生成对身份细微差别不敏感的全局特征。为了解决这个问题，我们在本研究中提出了一种基于CLIP的多粒度视觉-语言对齐框架。具体而言，在语言模态中引入了多个多粒度提示，以描述不同的身体部位并与视觉模态中的对应部分对齐。为了获取细粒度的视觉信息，采用了一种自适应掩码的多头自注意力模块，以精确提取特定部位特征。为了训练所提出的模块，采用了一种基于MLLM的视觉定位专家，自动生成身体部位的伪标签进行监督。在单源和多源泛化协议下进行的大量实验表明，我们的方法具有优越的性能。实现代码将发布在 https://github.com/RikoLi/MUVA。

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2603.14021

EI-Part: Explode for Completion and Implode for Refinement

EI-Part：爆炸以完成，收缩以精炼

Sun, Wanhu, Luo, Zhongjin, Zheng, Heliang, Chang, Jiahao, Ye, Chongjie, He, Huiang, Zhao, Shengchu, Jia, Rongfei, Han, Xiaoguang

Abstract

Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components, characterized by strong structural coherence, geometric plausibility, geometric fidelity, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy fully leverages spatial resolution, enabling flexible part completion and fine geometric detail generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments on multiple benchmarks demonstrate that EI-Part efficiently produces semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level 3D generation. Project page: https://cvhadessun.github.io/EI-Part/

Chinese Translation

部件级三维生成对游戏、电影制作和工业设计等各种下游应用至关重要。然而，将三维形状分解为几何上合理且有意义的组件仍然是一个重大挑战。以往的基于部件的生成方法往往难以生成结构良好的部件，表现出结构一致性差、几何不合理、精度低或效率低下等问题。为了解决这些挑战，我们提出了EI-Part，一个专门设计用于生成高质量三维形状的框架，其组件具有强结构一致性、几何合理性、几何保真性和生成效率。我们建议在不同阶段采用不同的表示：在部件完成阶段使用爆炸状态，在几何精炼阶段使用收缩状态。这一策略充分利用空间分辨率，实现灵活的部件完成和细致的几何细节生成。为了保持部件之间的结构一致性，我们在爆炸和收缩状态中都引入了自注意力机制，促进生成过程中组件之间有效的信息感知和特征融合。在多个基准上的广泛实验表明，EI-Part能够高效生成语义上有意义且结构一致的部件，具备细致的几何细节，在部件级三维生成中实现了最先进的性能。项目页面：https://cvhadessun.github.io/EI-Part/

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2603.14022

A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations

对象中心场景表示中的层次结构的双曲视角

Madan, Neelu, Pujol, Àlex, Møgelmose, Andreas, Escalera, Sergio, Nasrollahi, Kamal, Taylor, Graham W., Moeslund, Thomas B.

Abstract

Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.

Chinese Translation

插槽注意力（Slot attention）作为一种强大的无监督对象中心学习框架，已被提出，将视觉场景分解为一小组紧凑的向量表示，称为插槽（slots），每个插槽捕捉一个独特的区域或对象。然而，这些插槽是在欧几里得空间中学习的，这并未为自然构成视觉场景的层次关系提供几何归纳偏置。在本研究中，我们提出了一种简单的后处理管道，将欧几里得插槽嵌入投影到双曲空间的洛伦兹双曲面上，而无需修改基础训练管道。我们直接从插槽注意力掩码构建五级视觉层次，并分析双曲几何是否揭示了在欧几里得空间中不可见的潜在层次结构。将我们的管道与SPOT（图像）、VideoSAUR（视频）和SlotContrast（视频）相结合，我们发现双曲投影揭示了一种一致的场景级到对象级的组织，其中粗插槽占据比细插槽更大的流形深度，而这种现象在欧几里得空间中并不存在。我们进一步识别出一种“曲率-任务权衡”：低曲率（$c{=}0.2$）在父插槽检索中与欧几里得表现相当或更优，而中等曲率（$c{=}0.5$）则实现了更好的层间分离。综合来看，这些发现表明插槽表示已经编码了潜在的层次结构，而双曲几何则揭示了这一点，促使端到端的双曲训练成为自然的下一步。代码和模型可在 [github.com/NeeluMadan/HHS](https://github.com/NeeluMadan/HHS) 获取。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2603.14023

High-speed Imaging through Turbulence with Event-based Light Fields

通过湍流进行高速成像的事件驱动光场

Huang, Yu-Hsiang, Burner, Levi, Shah, Sachin, Qu, Ziyuan, Pediredla, Adithya, Metzler, Christopher A.

Abstract

This work introduces and demonstrates the first system capable of imaging fast-moving extended non-rigid objects through strong atmospheric turbulence at high frame rate. Event cameras are a novel sensing architecture capable of estimating high-speed imagery at thousands of frames per second. However, on their own event cameras are unable to disambiguate scene motion from turbulence. In this work, we overcome this limitation using event-based light field cameras: By simultaneously capturing multiple views of a scene, event-based light field cameras and machine learning-based reconstruction algorithms are able to disambiguate motion-induced dynamics, which produce events that are strongly correlated across views, from turbulence-induced dynamics, which produce events that are weakly correlated across view. Tabletop experiments demonstrate event-based light field can overcome strong turbulence while imaging high-speed objects traveling at up to 16,000 pixels per second.

Chinese Translation

本研究介绍并展示了首个能够在高帧率下通过强大气湍流成像快速移动的扩展非刚性物体的系统。事件相机是一种新颖的传感架构，能够以每秒数千帧的速度估计高速图像。然而，单靠事件相机无法将场景运动与湍流区分开来。在本研究中，我们通过事件驱动光场相机克服了这一限制：通过同时捕捉场景的多个视角，事件驱动光场相机和基于机器学习的重建算法能够区分由运动引起的动态（这些动态在视角间产生高度相关的事件）与由湍流引起的动态（这些动态在视角间产生低度相关的事件）。桌面实验表明，事件驱动光场能够在成像高速物体（速度可达每秒16,000像素）时克服强湍流的影响。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2603.14031

Intrinsic Tolerance in C-Arm Imaging: How Extrinsic Re-optimization Preserves 3D Reconstruction Accuracy

C臂成像中的内在容忍度：外在重新优化如何保持三维重建精度

Li, Lin, Aubert, Benjamin, Kemper, Paul, Plumley, Aric

Abstract

\textbf{Purpose:} C-arm fluoroscopy's 3D reconstruction relies on accurate intrinsic calibration, which is often challenging in clinical practice. This study ensures high-precision reconstruction accuracy by re-optimizing the extrinsic parameters to compensate for intrinsic calibration errors. \noindent\textbf{Methods:} We conducted both simulation and real-world experiments using five commercial C-arm systems. Intrinsic parameters were perturbed in controlled increments. Focal length was increased by 100 to 700 pixels ($\approx$20 mm to 140 mm) and principal point by 20 to 200 pixels. For each perturbation, we (1) reconstructed 3D points from known phantom geometries, (2) re-estimated extrinsic poses using standard optimization, and (3) measured reconstruction and reprojection errors relative to ground truth. \noindent\textbf{Results:} Even with focal length errors up to 500 pixels ($\approx$100 mm, assuming a nominal focal length of $\sim$1000 mm), mean 3D reconstruction error remained under 0.2 mm. Larger focal length deviations (700 pixels) elevated error to only $\approx$0.3 mm. Principal point shifts up to 200 pixels introduced negligible reconstruction error once extrinsic parameters were re-optimized, with reprojection error increases below 0.5 pixels. \noindent\textbf{Conclusion:} Moderate errors in intrinsic calibration can be effectively mitigated by extrinsic re-optimization, preserving submillimeter 3D reconstruction accuracy. This intrinsic tolerance suggests a practical pathway to relax calibration precision requirements, thereby simplifying C-arm system setup and reducing clinical workflow burden without compromising performance.

Chinese Translation

目的：C臂荧光成像的三维重建依赖于准确的内在标定，这在临床实践中往往具有挑战性。本研究通过重新优化外在参数来补偿内在标定误差，从而确保高精度的重建准确性。方法：我们使用五个商业C臂系统进行了模拟和实际实验。内在参数在受控增量下被扰动。焦距增加了100到700个像素（约20毫米到140毫米），主点偏移了20到200个像素。对于每个扰动，我们（1）从已知的幻影几何体重建三维点，（2）使用标准优化重新估计外在姿态，以及（3）相对于真实值测量重建和重投影误差。结果：即使焦距误差高达500个像素（约100毫米，假设名义焦距为约1000毫米），平均三维重建误差仍保持在0.2毫米以下。更大的焦距偏差（700个像素）使误差仅上升至约0.3毫米。主点偏移高达200个像素在重新优化外在参数后引入的重建误差可以忽略不计，重投影误差的增加低于0.5个像素。结论：内在标定的适度误差可以通过外在重新优化有效减轻，从而保持亚毫米级的三维重建精度。这种内在容忍度表明了一种实用的途径，可以放宽标定精度要求，从而简化C臂系统的设置，减少临床工作流程负担，而不影响性能。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2603.14039

EyeWorld: A Generative World Model of Ocular State and Dynamics

EyeWorld：眼部状态与动态的生成世界模型

Gao, Ziyu, Wu, Xinyuan, Chen, Xiaolan, Liu, Zhuoran, Chen, Ruoyu, Liu, Bowen, Yan, Bingjie, Wang, Zhenhan, Jin, Kai, Yang, Jiancheng, Tham, Yih Chung, He, Mingguang, Shi, Danli

Abstract

Ophthalmic decision-making depends on subtle lesion-scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation-stable latent ocular state shared across modalities, unifying fine-grained parsing, structure-preserving cross-modality translation and quality-robust enhancement within a single framework. Longitudinal supervision further enables time-conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis-oriented simulation in medicine.

Chinese Translation

眼科决策依赖于跨多模态影像和时间解释的细微病变尺度线索，然而大多数医学基础模型仍然是静态的，并且在模态和获取变化下表现不佳。在此，我们介绍EyeWorld，一个生成世界模型，将眼睛概念化为一个基于临床影像的部分观察动态系统。EyeWorld学习一个跨模态共享的观察稳定潜在眼部状态，统一了细粒度解析、结构保持的跨模态转换和质量鲁棒增强于一个单一框架内。纵向监督进一步实现了时间条件下的状态转变，支持临床上有意义的进展预测，同时保持稳定的解剖结构。通过从静态表示学习转向显式动态建模，EyeWorld为医学中的鲁棒多模态解释和以预后为导向的模拟提供了统一的方法。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2603.14052

A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

高效长视频推理的多智能体感知-行动联盟

Xu, Yichang, Liu, Gaowen, Kompella, Ramana Rao, Huang, Tiansheng, Hu, Sihao, Ilhan, Fatih, Tekin, Selim Furkan, Yahn, Zachary, Liu, Ling

Abstract

This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.

Chinese Translation

本文提出了一种多智能体感知-行动探索联盟，称为 A4VL，用于高效的长视频推理。A4VL 在多轮感知-行动探索循环中运作，选择 VLM（视频语言模型）智能体。在每一轮中，智能体团队通过感知探索和随后的行动探索进行视频问答（VideoQA）。在感知探索过程中，每个智能体学习从少量采样帧中提取与查询相关的感知线索，并执行基于线索的对齐，以找到与查询特定事件最相关的视频块。在行动探索过程中，A4VL 通过三个步骤进行视频推理：（1）每个智能体生成其初步答案及其推理，（2）所有智能体通过交叉评审和相关性排名相互评分，以及（3）根据是否达成令人满意的共识，决定是通过修剪（例如，过滤掉表现最差的智能体）和重新阶段（例如，基于新线索和匹配块的感知-行动探索）开始新一轮的感知-行动讨论，还是通过生成最终答案来结束。通过多轮感知-行动探索整合多智能体联盟，结合事件驱动的分区和线索引导的块对齐，使 A4VL 能够有效扩展到现实世界的长视频，同时保持高质量的视频推理。在五个流行的 VideoQA 基准上的评估结果表明，A4VL 超越了 18 个现有的代表性 VLM 和 10 个最近针对长视频推理优化的方法，同时实现了显著较低的推理延迟。我们的代码已发布在 https://github.com/git-disl/A4VL。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2603.14062

TMPDiff: Temporal Mixed-Precision for Diffusion Models

TMPDiff：用于扩散模型的时间混合精度

Lewandowski, Basile, Kurz, Simon, Shankar, Aditya, Birke, Robert, Chen, Jian-Jia, Chen, Lydia Y.

Abstract

Diffusion models are the go-to method for Text-to-Image generation, but their iterative denoising processes has high inference latency. Quantization reduces compute time by using lower bitwidths, but applies a fixed precision across all denoising timesteps, leaving an entire optimization axis unexplored. We propose TMPDiff, a temporal mixed-precision framework for diffusion models that assigns different numeric precision to different denoising timesteps. We hypothesize that quantization errors accumulate additively across timesteps, which we then validate experimentally. Based on our observations, we develop an adaptive bisectioning-based algorithm, which assigns per-step precisions with linear evaluation complexity, reducing an otherwise exponential search problem. Across four state-of-the-art diffusion models and three datasets, TMPDiff consistently outperforms uniform-precision baselines at matched speedup, achieving 10 to 20% improvement in perceptual quality. On FLUX.1-dev, TMPDiff achieves 90% SSIM relative to the full-precision model at a speedup of 2.5x over 16-bit inference.

Chinese Translation

扩散模型是文本到图像生成的首选方法，但其迭代去噪过程具有较高的推理延迟。量化通过使用较低的位宽来减少计算时间，但在所有去噪时间步上应用固定精度，导致一个完整的优化维度未被探索。我们提出了TMPDiff，一个用于扩散模型的时间混合精度框架，它为不同的去噪时间步分配不同的数值精度。我们假设量化误差在时间步之间是累加的，并通过实验进行了验证。基于我们的观察，我们开发了一种自适应二分法算法，该算法以线性评估复杂度分配每一步的精度，从而减少了原本呈指数增长的搜索问题。在四个最先进的扩散模型和三个数据集上，TMPDiff在匹配加速的情况下始终优于均匀精度基线，感知质量提高了10%到20%。在FLUX.1-dev上，TMPDiff在16位推理的加速比为2.5x的情况下，相对于全精度模型达到了90%的结构相似性指数（SSIM）。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2603.14073

MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation

MotionCFG：通过随机概念扰动增强运动动态

Kim, Byungjun, Um, Soobin, Ye, Jong Chul

Abstract

Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. "static", "blurry"), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.

Chinese Translation

尽管最近在文本到视频（Text-to-Video，T2V）合成方面取得了进展，但生成高保真和动态的运动仍然是一个重大挑战。现有方法主要依赖于无分类器引导（Classifier-Free Guidance，CFG），通常使用明确的负面提示（例如“静态”、“模糊”）来抑制不希望出现的伪影。然而，这种明确的否定常常引入意想不到的语义偏差并扭曲对象的完整性；我们将这种现象定义为内容-运动漂移（Content-Motion Drift）。为了解决这个问题，我们提出了MotionCFG，一个通过将目标概念与其噪声扰动的对应物进行对比来增强运动动态的框架。具体而言，通过向概念嵌入中注入高斯噪声，MotionCFG创建了局部负锚点，封装了一系列广泛的互补次优运动变体。与明确的否定不同，这种方法促进了隐式的硬负样本挖掘，而不改变全局语义身份，从而允许对时间细节进行有针对性的精细化。结合限制干预于早期去噪步骤的分段引导计划，MotionCFG在最先进的T2V框架中始终如一地改善运动动态，同时计算开销极小，视觉质量损失也最小。此外，我们还证明这种噪声诱导的对比机制不仅对锐化运动轨迹有效，还能引导复杂的非线性概念，如精确的对象数量，这通常难以通过标准的基于文本的引导进行调节。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2603.14074

Self-Supervised Uncertainty Estimation For Super-Resolution of Satellite Images

自监督不确定性估计在卫星图像超分辨率中的应用

Zheng, Zhe, Dewil, Valéry, Arias, Pablo

Abstract

Super-resolution (SR) of satellite imagery is challenging due to the lack of paired low-/high-resolution data. Recent self-supervised SR methods overcome this limitation by exploiting the temporal redundancy in burst observations, but they lack a mechanism to quantify uncertainty in the reconstruction. In this work, we introduce a novel self-supervised loss that allows to estimate uncertainty in image super-resolution without ever accessing the ground-truth high-resolution data. We adopt a decision-theoretic perspective and show that minimizing the corresponding Bayesian risk yields the posterior mean and variance as optimal estimators. We validate our approach on a synthetic SkySat L1B dataset and demonstrate that it produces calibrated uncertainty estimates comparable to supervised methods. Our work bridges self-supervised restoration with uncertainty quantification, making a practical framework for uncertainty-aware image reconstruction.

Chinese Translation

卫星图像的超分辨率（SR）由于缺乏配对的低分辨率和高分辨率数据而具有挑战性。近期的自监督SR方法通过利用突发观测中的时间冗余克服了这一限制，但缺乏量化重建不确定性的机制。在本研究中，我们引入了一种新颖的自监督损失，能够在不接触真实高分辨率数据的情况下估计图像超分辨率中的不确定性。我们采用决策理论的视角，表明最小化相应的贝叶斯风险可以得到后验均值和方差作为最优估计量。我们在合成的SkySat L1B数据集上验证了我们的方法，并展示其产生的校准不确定性估计与监督方法相当。我们的工作将自监督恢复与不确定性量化相结合，为不确定性感知的图像重建提供了一个实用框架。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2603.14076

SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

SGR-OCC：通过软门控提升和语义自适应几何精细化演化单目先验以实现具身3D占用预测

Guo, Yiran, Mentasti, Simone, Jin, Xiaofeng, Frosi, Matteo, Matteucci, Matteo

Abstract

3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.

Chinese Translation

3D语义占用预测是具身人工智能的基石，使得代理能够从单目视频流中逐步感知密集场景几何和语义。然而，当前的在线框架面临两个关键瓶颈：单目估计固有的深度模糊性导致物体边界处的“特征渗漏”，以及“冷启动”不稳定性，在早期训练阶段未初始化的时间融合层扭曲高质量的空间先验。本文提出了SGR-OCC（软门控和光线精细化占用），这是一个以“继承与演化”理念驱动的统一框架。为了完美继承单目空间专业知识，我们引入了一个软门控特征提升器，该提升器通过高斯门显式建模深度不确定性，以概率方式抑制背景噪声。此外，动态光线约束锚点精细化模块将复杂的3D位移搜索简化为沿相机光线的高效1D深度修正，确保与物理表面的亚体素一致性。为了确保向时间一致性的稳定演化，我们采用了配备身份初始化融合的两阶段渐进训练策略，有效解决了冷启动问题，并保护空间先验免受早期噪声梯度的影响。在EmbodiedOcc-ScanNet和Occ-ScanNet基准上的大量实验表明，SGR-OCC达到了最先进的性能。在局部预测任务中，SGR-OCC实现了58.55% 的完成IoU和49.89% 的语义mIoU，分别超越了之前的最佳方法EmbodiedOcc++ 3.65% 和3.69%。在具有挑战性的具身预测任务中，我们的模型达到了55.72% 的SC-IoU和46.22% 的mIoU。定性结果进一步确认了我们模型在复杂室内环境中保持结构完整性和边界清晰度的卓越能力。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2603.14077

Enhancing Eye Feature Estimation from Event Data Streams through Adaptive Inference State Space Modeling

通过自适应推理状态空间建模增强事件数据流中的眼部特征估计

Nguyen, Viet Dung, Ghorbaninejad, Mobina, Ma, Chengyi, Bailey, Reynold, Diaz, Gabriel J., Fix, Alexander, Suess, Ryan J., Ororbia, Alexander

Abstract

Eye feature extraction from event-based data streams can be performed efficiently and with low energy consumption, offering great utility to real-world eye tracking pipelines. However, few eye feature extractors are designed to handle sudden changes in event density caused by the changes between gaze behaviors that vary in their kinematics, leading to degraded prediction performance. In this work, we address this problem by introducing the \emph{adaptive inference state space model} (AISSM), a novel architecture for feature extraction that is capable of dynamically adjusting the relative weight placed on current versus recent information. This relative weighting is determined via estimates of the signal-to-noise ratio and event density produced by a complementary \emph{dynamic confidence network}. Lastly, we craft and evaluate a novel learning technique that improves training efficiency. Experimental results demonstrate that the AISSM system outperforms state-of-the-art models for event-based eye feature extraction.

Chinese Translation

从基于事件的数据流中提取眼部特征可以高效且低能耗地进行，为实际眼动追踪管道提供了巨大的实用价值。然而，目前很少有眼部特征提取器能够处理由于不同运动学的注视行为变化引起的事件密度突变，从而导致预测性能下降。针对这一问题，本文引入了 extit{自适应推理状态空间模型}（AISSM），一种新颖的特征提取架构，能够动态调整当前信息与近期信息的相对权重。该相对权重的确定基于由一个补充的 extit{动态置信网络}生成的信噪比和事件密度的估计。最后，我们设计并评估了一种新颖的学习技术，以提高训练效率。实验结果表明，AISSM系统在基于事件的眼部特征提取方面优于最先进的模型。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2603.14086

Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining

通过领域专用的 DINO 预训练实现 3D 医学配准的有效特征学习

Kats, Eytan, Heinrich, Mattias P.

Abstract

Medical image registration is a critical component of clinical imaging workflows, enabling accurate longitudinal assessment, multi-modal data fusion, and image-guided interventions. Intensity-based approaches often struggle with interscanner variability and complex anatomical deformations, whereas feature-based methods offer improved robustness by leveraging semantically informed representations. In this work, we investigate DINO-style self-supervised pretraining directly on 3D medical imaging data, aiming to learn dense volumetric features well suited for deformable registration. We assess the resulting representations on challenging interpatient abdominal registration task across both MRI and CT modalities. Our domain-specialized pretraining outperforms the DINOv2 model trained on a large-scale collection of natural images, while requiring substantially lower computational resources at inference time. Moreover, it surpasses established registration models under out-of-domain evaluation, demonstrating the value of task-agnostic yet medical imaging-focused pretraining for robust and efficient 3D image registration.

Chinese Translation

医学图像配准是临床成像工作流程中的关键组成部分，能够实现准确的纵向评估、多模态数据融合和图像引导干预。基于强度的方法常常在扫描仪间变异性和复杂的解剖变形方面遇到困难，而基于特征的方法则通过利用语义信息丰富的表示，提供了更好的鲁棒性。在本研究中，我们直接在 3D 医学成像数据上研究 DINO 风格的自监督预训练，旨在学习适合可变形配准的密集体积特征。我们在 MRI 和 CT 模态下评估了在具有挑战性的患者间腹部配准任务中获得的表示。我们的领域专用预训练在性能上优于在大规模自然图像集合上训练的 DINOv2 模型，同时在推理时所需的计算资源显著更低。此外，它在域外评估中超越了已建立的配准模型，展示了针对医学成像的任务无关预训练在实现鲁棒和高效的 3D 图像配准中的价值。

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2603.14112

Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

重新审视空间-语义引导超分辨率中的感知-失真权衡

Wang, Dan, Sun, Haiyan, Du, Shan, Wang, Z. Jane, An, Zhaochong, Belongie, Serge, Cui, Xinrui

Abstract

Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

Chinese Translation

图像超分辨率（SR）旨在重建具有高感知质量和低失真的高分辨率图像，但在本质上受到感知-失真权衡的限制。基于生成对抗网络（GAN）的超分辨率方法减少了失真，但在真实细腻纹理的重建上仍然面临挑战，而基于扩散的方法则合成了丰富的细节，但往往偏离输入，产生虚幻结构并降低保真度。这种紧张关系提出了一个关键挑战：如何在不牺牲保真度的情况下利用扩散模型强大的生成先验。为了解决这个问题，我们提出了SpaSemSR，一种具有两种互补引导的空间-语义引导扩散框架。首先，基于空间的文本引导将对象级空间线索与语义提示相结合，协调文本和视觉结构以减少失真。其次，具有多编码器设计和语义降级约束的语义增强视觉引导统一了多模态语义先验，在严重降级的情况下提高了感知现实性。这些互补引导通过空间-语义注意机制自适应地融合到扩散过程中，抑制失真和虚幻，同时保留扩散模型的优势。在多个基准上的广泛实验表明，SpaSemSR实现了优越的感知-失真平衡，生成了既真实又忠实的重建结果。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2603.14117

Improving Visual Reasoning with Iterative Evidence Refinement

通过迭代证据精炼提升视觉推理能力

Shi, Zeru, Mei, Kai, Quan, Yihao, Metaxas, Dimitris N., Tang, Ruixiang

Abstract

Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional grounding is needed, enabling later steps to condition on relevant visual cues without external tool calls or re-encoding. We use reinforcement learning to teach the model when to trigger visual revisiting and which region embeddings to retrieve and insert during the reasoning process. Experiments on multiple visual reasoning benchmarks, together with perception, reasoning, and hallucination evaluations, show that SIEVE yields consistent gains, improving performance by 8 percent on average across several benchmarks.

Chinese Translation

视觉语言模型（VLMs）在图像推理方面的能力日益增强，但稳健的视觉推理通常需要在基础视觉证据中重新定位中间步骤。最近的方法通常依赖于外部图像操作，如缩放或裁剪，以在推理过程中重新访问细粒度细节，这需要额外的图像重新编码，并可能打断推理轨迹。我们认为，VLMs 已经提供了强大的内部信号来识别和重用视觉证据，这些信号可以直接用于支持基于图像的推理。基于这一见解，我们提出了一种端到端的自我重访框架 SIEVE，该框架训练模型通过内部表示重新利用图像证据。SIEVE 自动提取显著图像区域的嵌入，并在需要额外定位时将其注入推理链中，使后续步骤能够在不调用外部工具或重新编码的情况下，基于相关视觉线索进行条件推理。我们使用强化学习来教导模型何时触发视觉重访，以及在推理过程中检索和插入哪些区域嵌入。在多个视觉推理基准上的实验，以及感知、推理和幻觉评估表明，SIEVE 一致地带来了性能提升，在多个基准上平均提高了 8%。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2603.14120

Low-Field Magnetic Resonance Image Quality Enhancement using Undersampled k-Space and Out-of-Distribution Generalisation

基于欠采样k空间和分布外泛化的低场磁共振成像质量增强

Anyimadu, Daniel Tweneboah, Abdelsamea, Mohammed M., Eldaly, Ahmed Karam

Abstract

Low-field magnetic resonance imaging (MRI) offers affordable access to diagnostic imaging but faces challenges such as prolonged acquisition times and reduced image quality. Although accelerated imaging via k-space undersampling helps reduce scan time, image quality enhancement methods often rely on spatial-domain postprocessing. Deep learning achieved state-of-the-art results in both domains. However, most models are trained and evaluated using in-distribution (InD) data, creating a significant gap in understanding model performance when tested using out-of-distribution (OOD) data. To address these issues, we propose a novel framework that reconstructs high-field-like MR images directly from undersampled low-field MRI k-space, quantifies the impact of reduced sampling, and evaluates the generalisability of the model using OOD. Our approach utilises a k-space dual channel U-Net to jointly process the real and imaginary components of undersampled k-space, restoring missing frequency content, and incorporates an ensemble strategy to generate uncertainty maps. Experiments on low-field brain MRI demonstrate that our k-space-driven image quality enhancement outperforms the counterpart spatial-domain and other state-of-the-art baselines, achieving image quality comparable to full high-field k-space acquisitions using OOD data. To the best of our knowledge, this work is among the first to combine low-field MR image reconstruction, quality enhancement using undersampled k-space, and uncertainty quantification within a unified framework.

Chinese Translation

低场磁共振成像（MRI）提供了经济实惠的诊断成像手段，但面临着如扫描时间延长和图像质量降低等挑战。尽管通过k空间欠采样加速成像有助于减少扫描时间，但图像质量增强方法通常依赖于空间域后处理。深度学习在这两个领域都取得了最先进的成果。然而，大多数模型是在分布内（InD）数据上进行训练和评估的，这在使用分布外（OOD）数据进行测试时造成了对模型性能理解的显著差距。为了解决这些问题，我们提出了一种新颖的框架，直接从欠采样的低场MRI k空间重建高场类似的MR图像，量化采样减少的影响，并使用OOD评估模型的泛化能力。我们的方法利用k空间双通道U-Net共同处理欠采样k空间的实部和虚部，恢复缺失的频率内容，并结合集成策略生成不确定性图。对低场脑MRI的实验表明，我们的k空间驱动的图像质量增强优于对应的空间域方法和其他最先进的基线，使用OOD数据实现了与完整高场k空间采集相当的图像质量。据我们所知，这项工作是首次在统一框架内结合低场MR图像重建、基于欠采样k空间的质量增强和不确定性量化。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2603.14125

Low-Field Magnetic Resonance Image Enhancement using Undersampled k-Space

基于欠采样k空间的低场磁共振成像增强

Anyimadu, Daniel Tweneboah, Abdalla, Mohammed, Abdelsamea, Mohammed M., Eldaly, Ahmed Karam

Abstract

Low-field magnetic resonance imaging (MRI) offers a cost-effective alternative for medical imaging in resource-limited settings. However, its widespread adoption is hindered by two key challenges: prolonged scan times and reduced image quality. Accelerated acquisition can be achieved using k-space undersampling, while image enhancement traditionally relies on spatial-domain postprocessing. In this work, we propose a novel deep learning framework based on a U-Net variant that operates directly in k-space to super-resolve low-field MR images directly using undersampled data while quantifying the impact of reduced k-space sampling. Unlike conventional approaches that treat image super-resolution as a postprocessing step following image reconstruction from undersampled k-space, our unified model integrates both processes, leveraging k-space information to achieve superior image fidelity. Extensive experiments on synthetic and real low-field brain MRI datasets demonstrate that k-space-driven image super-resolution outperforms conventional spatial-domain counterparts. Furthermore, our results show that undersampled k-space reconstructions achieve comparable quality to full k-space acquisitions, enabling substantial scan-time acceleration without compromising diagnostic utility.

Chinese Translation

低场磁共振成像（MRI）为资源有限的环境提供了一种经济有效的医学成像替代方案。然而，其广泛应用受到两个主要挑战的制约：扫描时间延长和图像质量降低。通过使用k空间欠采样可以实现加速采集，而图像增强传统上依赖于空间域后处理。在本研究中，我们提出了一种基于U-Net变体的新颖深度学习框架，该框架直接在k空间中操作，利用欠采样数据对低场MR图像进行超分辨率重建，同时量化降低k空间采样的影响。与传统方法将图像超分辨率视为从欠采样k空间重建图像后的后处理步骤不同，我们的统一模型将这两个过程整合在一起，利用k空间信息实现更高的图像保真度。在合成和真实低场脑MRI数据集上的大量实验表明，基于k空间驱动的图像超分辨率优于传统的空间域方法。此外，我们的结果表明，欠采样k空间重建的质量可与完整k空间采集相媲美，从而在不影响诊断效用的情况下实现显著的扫描时间加速。

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2603.14127

Implementation and discussion of the Pith Estimation on Rough Log End Images using Local Fourier Spectrum Analysis method

基于局部傅里叶谱分析方法的粗糙木材端面图像中的心材估计的实现与讨论

Marichal, Henry, Passarella, Diego, Randall, Gregory

Abstract

In this article, we analyze and propose a Python implementation of the method "Pith Estimation on Rough Log End images using Local Fourier Spectrum Analysis", by Rudolf Schraml and Andreas Uhl. The algorithm is tested over two datasets.

Chinese Translation

在本文中，我们分析并提出了“基于局部傅里叶谱分析的粗糙木材端面图像心材估计”方法的Python实现，该方法由Rudolf Schraml和Andreas Uhl提出。该算法在两个数据集上进行了测试。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2603.14128

Diffusion Reinforcement Learning via Centered Reward Distillation

通过中心化奖励蒸馏的扩散强化学习

Zhu, Yuanzhi, Wang, Xi, Lathuilière, Stéphane, Kalogeiton, Vicky

Abstract

Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

Chinese Translation

扩散和流模型在生成性能上达到了最先进水平（SOTA），然而许多在实践中重要的行为，如细粒度的提示保真度、组合正确性和文本渲染，往往在得分或流匹配的预训练目标中被弱化指定。使用外部黑箱奖励进行强化学习（RL）微调是一个自然的解决方案，但扩散强化学习往往不够稳健。基于轨迹的方法会产生高内存成本和高方差的梯度估计；前向过程方法收敛较快，但可能会遭遇分布漂移，从而导致奖励操控。在本研究中，我们提出了 extbf{中心化奖励蒸馏（Centered Reward Distillation, CRD）}，这是一个基于前向过程微调的KL正则化奖励最大化推导出的扩散强化学习框架。关键的见解是，在 extit{提示内中心化}下，难以处理的归一化常数会相互抵消，从而产生一个良定义的奖励匹配目标。为了实现可靠的文本到图像微调，我们引入了明确控制分布漂移的技术：( extit{i}) 将采样器与移动参考解耦，以防止比率信号崩溃；( extit{ii}) KL锚定到一个基于CFG的预训练模型，以控制长期漂移并与预训练模型的推理时语义对齐；( extit{iii}) 奖励自适应的KL强度，以在大KL正则化下加速早期学习，同时减少后期对奖励模型漏洞的利用。基于 exttt{GenEval}和 exttt{OCR}奖励的文本到图像后训练实验表明，CRD在奖励优化结果上达到了具有竞争力的SOTA水平，且收敛速度快，减少了奖励操控，这在未见的偏好指标上得到了验证。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2603.14132

DualSwinFusionSeg: Multimodal Martian Landslide Segmentation via Dual Swin Transformer with Multi-Scale Fusion and UNet++

DualSwinFusionSeg：通过双重Swin Transformer与多尺度融合和UNet++进行多模态火星滑坡分割

Kabir, Shahriar, Ehsan, Abdullah Muhammed Amimul, Rifti, Istiak Ahmmed, Reza, Md Kaykobad

Abstract

Automated segmentation of Martian landslides, particularly in tectonically active regions such as Valles Marineris,is important for planetary geology, hazard assessment, and future robotic exploration. However, detecting landslides from planetary imagery is challenging due to the heterogeneous nature of available sensing modalities and the limited number of labeled samples. Each observation combines RGB imagery with geophysical measurements such as digital elevation models, slope maps, thermal inertia, and contextual grayscale imagery, which differ significantly in resolution and statistical properties. To address these challenges, we propose DualSwinFusionSeg, a multimodal segmentation architecture that separates modality-specific feature extraction and performs multi-scale cross-modal fusion. The model employs two parallel Swin Transformer V2 encoders to independently process RGB and auxiliary geophysical inputs, producing hierarchical feature representations. Corresponding features from the two streams are fused at multiple scales and decoded using a UNet++ decoder with dense nested skip connections to preserve fine boundary details. Extensive ablation studies evaluate modality contributions, loss functions, decoder architectures, and fusion strategies. Experiments on the MMLSv2 dataset from the PBVS 2026 Mars-LS Challenge show that modality-specific encoders and simple concatenation-based fusion improve segmentation accuracy under limited training data. The final model achieves 0.867 mIoU and 0.905 F1 on the development benchmark and 0.783 mIoU on the held-out test set, demonstrating strong performance for multimodal planetary surface segmentation.

Chinese Translation

火星滑坡的自动分割，特别是在如瓦列斯·马里内里斯等构造活动区域，对于行星地质、灾害评估和未来的机器人探索至关重要。然而，由于可用传感模态的异质性和标记样本数量有限，从行星影像中检测滑坡具有挑战性。每次观测结合了RGB影像与地球物理测量数据，如数字高程模型、坡度图、热惯量和上下文灰度影像，这些数据在分辨率和统计特性上存在显著差异。为了解决这些挑战，我们提出了DualSwinFusionSeg，这是一种多模态分割架构，能够分离模态特定的特征提取，并进行多尺度跨模态融合。该模型采用两个并行的Swin Transformer V2编码器，独立处理RGB和辅助地球物理输入，生成分层特征表示。来自两个流的对应特征在多个尺度上进行融合，并使用具有密集嵌套跳跃连接的UNet++解码器进行解码，以保留细微的边界细节。广泛的消融研究评估了模态贡献、损失函数、解码器架构和融合策略。在PBVS 2026 Mars-LS Challenge的MMLSv2数据集上的实验表明，模态特定的编码器和基于简单拼接的融合在有限的训练数据下提高了分割精度。最终模型在开发基准上达到0.867 mIoU和0.905 F1，在保留的测试集上达到0.783 mIoU，展示了在多模态行星表面分割方面的强大性能。

View on arXiv Download PDF AI Translation

cs.CV / 152 / 2603.14150

CIPHER: Culvert Inspection through Pairwise Frame Selection and High-Efficiency Reconstruction

CIPHER：通过成对帧选择和高效重建进行涵洞检查

Lee, Seoyoung, Wang, Zhangyang

Abstract

Automated culvert inspection systems can help increase the safety and efficiency of flood management operations. As a key step to this system, we present an efficient RGB-based 3D reconstruction pipeline for culvert-like structures in visually repetitive environments. Our approach first selects informative frame pairs to maximize viewpoint diversity while ensuring valid correspondence matching using a plug-and-play module, followed by a reconstruction model that simultaneously estimates RGB appearance, geometry, and semantics in real-time. Experiments demonstrate that our method effectively generates accurate 3D reconstructions and depth maps, enhancing culvert inspection efficiency with minimal human intervention.

Chinese Translation

自动化涵洞检查系统可以帮助提高洪水管理操作的安全性和效率。作为该系统的关键步骤，我们提出了一种高效的基于RGB的3D重建管道，用于视觉重复环境中的涵洞类结构。我们的方法首先选择信息丰富的帧对，以在确保有效对应匹配的同时最大化视角多样性，利用即插即用模块，随后通过重建模型实时估计RGB外观、几何和语义。实验表明，我们的方法有效生成准确的3D重建和深度图，以最小的人为干预提升涵洞检查效率。

View on arXiv Download PDF AI Translation

cs.CV / 153 / 2603.14151

Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images

透视PRISM：科学图像的复合与可控恢复

Kurinchi-Vendhan, Rupa, Sharma, Pratyusha, Torralba, Antonio, Beery, Sara

Abstract

Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard "black-box" restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.

Chinese Translation

科学和环境图像常常受到与传感器和环境相关的复杂噪声混合的影响。现有的恢复方法通常一次性去除一种退化，导致级联伪影、过度校正或有意义信号的丧失。在科学应用中，恢复必须能够同时处理复合退化，同时允许专家选择性地去除失真子集，而不抹去重要特征。为了解决这些挑战，我们提出了PRISM（Precision Restoration with Interpretable Separation of Mixtures）。PRISM是一个提示条件扩散框架，它结合了对混合退化的复合感知监督和一个加权对比解耦目标，该目标在潜在空间中对齐原始元素及其混合。这种组合几何使得高保真地联合去除重叠失真成为可能，同时也允许通过自然语言提示进行灵活、针对性的修复。在显微镜、野生动物监测、遥感和城市天气数据集上，PRISM在复杂的复合退化上超越了最先进的基准，包括训练期间未见的零-shot混合。重要的是，我们展示了选择性恢复在多个领域显著提高了下游科学准确性，优于标准的“黑箱”恢复。这些结果确立了PRISM作为一个可推广和可控的高保真恢复框架，适用于科学效用优先的领域。

View on arXiv Download PDF AI Translation

cs.CV / 154 / 2603.14152

SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation

SK-适配器：基于骨架的原生3D生成结构控制

Wang, Anbang, Ao, Yuzhuo, Wu, Shangzhe, Tang, Chi-Keung

Abstract

Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: https://sk-adapter.github.io/

Chinese Translation

原生3D生成模型已实现显著的逼真度和速度，但它们却面临一个关键限制：无法精确规定结构关节，而在原生3D空间内的精确结构控制仍然未得到充分探索。本文提出SK-适配器，一个简单且高效的框架，能够解锁原生3D生成中的精确骨骼操作。我们将3D骨架视为一种主要的控制信号，超越了可能模糊的文本或图像提示。SK-适配器是一个轻量级的结构适配网络，它将关节坐标和拓扑编码成可学习的标记，并通过交叉注意机制注入到冷冻的3D生成主干中。这种智能设计不仅使模型能够有效“关注”特定的3D结构约束，还能保持其原有的生成先验。为了填补数据差距，我们提供了Objaverse-TMS数据集，这是一个包含24k文本-网格-骨架对的大规模数据集。广泛的实验证明，我们的方法在保持基础模型的几何形状和纹理质量的同时，能够实现强大的结构控制，显著优于现有基准。此外，我们将这一能力扩展到局部3D编辑，使得在骨骼引导下对现有资产进行区域特定编辑成为可能，而这一点是之前的方法无法实现的。项目页面：https://sk-adapter.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 155 / 2603.14153

Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories

Garments2Look：用于高保真服装级虚拟试穿的多参考数据集

Hu, Junyao, Cheng, Zhongwei, Wong, Waikeung, Zou, Xingxing

Abstract

Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.

Chinese Translation

虚拟试穿（VTON）在单件服装可视化方面取得了进展，但现实世界的时尚中心在于包含多件服装、配饰、细粒度类别、分层和多样化风格的完整服装，这仍然超出了当前VTON系统的能力。现有数据集在类别上有限，缺乏服装多样性。我们推出了Garments2Look，这是第一个大规模多模态数据集，专注于服装级VTON，包含80,000对多件服装对应单一造型的配对，涵盖40个主要类别和300多个细粒度子类别。每对配对包括一套服装及其3-12张参考服装图像（平均4.48），一张穿着该服装的模特图像，以及详细的物品和试穿文本注释。为了平衡真实性和多样性，我们提出了一种合成管道。该管道在生成试穿结果之前，通过启发式构建服装列表，并对整个过程进行严格的自动过滤和人工验证，以确保数据质量。为了探测任务难度，我们调整了最先进的VTON方法和通用图像编辑模型，以建立基线。结果表明，当前方法在无缝试穿完整服装和推断正确的分层和造型方面存在困难，导致错位和伪影。

View on arXiv Download PDF AI Translation

cs.CV / 156 / 2603.14176

BluRef: Unsupervised Image Deblurring with Dense-Matching References

BluRef：一种基于密集匹配参考的无监督图像去模糊方法

Pham, Bang-Dang, Tran, Anh, Pham, Cuong, Hoai, Minh

Abstract

This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.

Chinese Translation

本文介绍了一种新颖的无监督图像去模糊方法，该方法利用简单的训练数据收集过程，从而增强了去模糊方法的适用性和有效性。我们的技术不需要精确配对的模糊图像和对应的清晰图像；相反，它使用相似场景的未配对模糊图像和清晰图像，通过利用密集匹配模型识别模糊图像与参考清晰图像之间的对应关系，生成伪真实数据。由于训练数据收集过程的简单性，我们的方法不依赖于现有的配对训练数据或预训练网络，使其更适应各种场景，并适用于不同规模的网络，包括为低资源设备设计的网络。我们证明了这一新方法实现了最先进的性能，标志着图像去模糊领域的重大进展。

View on arXiv Download PDF AI Translation

cs.CV / 157 / 2603.14184

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深的思考，更弱的目标：理解和缓解多模态大型语言模型在推理过程中感知障碍

Peng, Ruiying, Wu, Xueyu, Lei, Jing, Hou, Lu, Ma, Yuanzheng, Li, Xiaohui

Abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

Chinese Translation

多模态大型语言模型（MLLMs）在延长推理模式下常常遭遇感知障碍，尤其是在视觉问答（VQA）任务中。我们识别出注意力分散是其根本原因：在多步骤推理过程中，模型的视觉注意力变得分散，偏离与问题相关的区域，实际上“失去了对视觉输入的关注”。为了更好地理解这一现象，我们分析了MLLMs的注意力图，观察到推理提示显著减少了对回答问题至关重要区域的关注。我们进一步发现，模型对图像标记的整体注意力与其在图像内的注意力空间分散性之间存在强相关性。基于这一洞察，我们提出了一种无训练的视觉区域引导注意力（Visual Region-Guided Attention, VRGA）框架，该框架基于熵-聚焦标准选择视觉头并重新加权其注意力，有效引导模型在推理过程中关注与问题相关的区域。在视觉-语言基准上的大量实验表明，我们的方法有效缓解了感知退化，提高了视觉定位和推理准确性，同时提供了对MLLMs如何处理视觉信息的可解释性见解。

View on arXiv Download PDF AI Translation

cs.CV / 158 / 2603.14186

Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

新兴单步生成模型与多步扩散及流模型的公平基准测试

Ravishankar, Advaith, Liu, Serena, Wang, Mingyang, Zhou, Todd, Zhou, Jeffrey, Sharma, Arnav, Hu, Ziling, Das, Léopold, Sobirov, Abdulaziz, Siddique, Faizaan, Yu, Freddy, Baek, Seungjoo, Luo, Yan, Wang, Mengyu

Abstract

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.

Chinese Translation

最先进的文本到图像模型能够生成高质量的图像，但推理仍然昂贵，因为生成过程需要多个连续的常微分方程（ODE）或去噪步骤。原生单步模型旨在通过在单个步骤中将噪声映射到图像来降低这一成本，然而，由于研究使用了不匹配的采样步骤和不同的无分类器引导（CFG）设置，进行公平比较变得困难，其中CFG可能会在相反的方向上影响FID、Inception Score和基于CLIP的对齐效果。此外，目前尚不清楚单步模型在多步推理中的扩展能力如何，并且在ImageNet之外，针对标签ID条件生成器的标准化分布外评估有限。为了解决这一问题，我们在ImageNet验证集、ImageNetV2和我们的新校对的分布外数据集reLAIONet（与ImageNet标签ID对齐）上，采用受控的类别条件协议对八个模型进行了基准测试，这些模型涵盖了单步流（MeanFlow、Improved MeanFlow、SoFlow）、多步基线（RAE、Scale-RAE）和已建立系统（SiT、Stable Diffusion 3.5、FLUX.1）。通过使用FID、Inception Score、CLIP Score和Pick Score，我们表明，专注于FID的模型开发和CFG选择在少步推理中可能具有误导性，其中引导变化可能会改善FID，同时降低文本与图像的对齐效果和人类偏好信号，并恶化感知质量。我们进一步表明，领先的单步模型在步骤扩展中受益，并在多步推理下变得更具竞争力，尽管它们仍然表现出特征性局部失真。为了捕捉这些权衡，我们引入了最小最大调和平均数（MinMax Harmonic Mean, MMHM），这是一个综合代理，能够在所有四个指标上稳定超参数选择，适用于引导和步骤的变化。

View on arXiv Download PDF AI Translation

cs.CV / 159 / 2603.14187

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

常规组织学中的深度学习提高了前列腺癌生化复发的风险分层

Grisi, Clément, Faryna, Khrystyna, Uysal, Nefise, Agosti, Vittorio, Munari, Enrico, Kammerer-Jacquet, Solène-Florence, Salles, Paulo Guilherme de Oliveira, Tolkach, Yuri, Büttner, Reinhard, Semko, Sofiya, Pikul, Maksym, Heidenreich, Axel, van der Laak, Jeroen, Litjens, Geert

Abstract

Accurate prediction of biochemical recurrence (BCR) after radical prostatectomy is critical for guiding adjuvant treatment and surveillance decisions in prostate cancer. However, existing clinicopathological risk models reduce complex morphology to relatively coarse descriptors, leaving substantial prognostic information embedded in routine histopathology underexplored. We present a deep learning-based biomarker that predicts continuous, patient-specific risk of BCR directly from H&E-stained whole-slide prostatectomy specimens. Trained end-to-end on time-to-event outcomes and evaluated across four independent international cohorts, our model demonstrates robust generalization across institutions and patient populations. When integrated with the CAPRA-S clinical risk score, the deep learning risk score consistently improved discrimination for BCR, increasing concordance indices from 0.725-0.772 to 0.749-0.788 across cohorts. To support clinical interpretability, outcome-grounded analyses revealed subtle histomorphological patterns associated with recurrence risk that are not captured by conventional clinicopathological risk scores. This multicohort study demonstrates that deep learning applied to routine prostate histopathology can deliver reproducible and clinically generalizable biomarkers that augment postoperative risk stratification, with potential to support personalized management of prostate cancer in real-world clinical settings.

Chinese Translation

在根治性前列腺切除术后，准确预测生化复发（BCR）对指导前列腺癌的辅助治疗和监测决策至关重要。然而，现有的临床病理风险模型将复杂的形态学简化为相对粗糙的描述符，导致常规组织病理学中潜在的预后信息未得到充分探索。我们提出了一种基于深度学习的生物标志物，能够直接从H&E染色的全切片前列腺切除标本中预测连续的、特定患者的BCR风险。该模型在时间到事件的结果上进行了端到端训练，并在四个独立的国际队列中进行了评估，显示出在不同机构和患者群体中的稳健泛化。当与CAPRA-S临床风险评分结合时，深度学习风险评分始终改善了对BCR的区分能力，使各队列的一致性指数从0.725-0.772提高到0.749-0.788。为了支持临床可解释性，基于结果的分析揭示了与复发风险相关的细微组织形态学模式，而这些模式并未被传统的临床病理风险评分所捕捉。这项多队列研究表明，应用于常规前列腺组织病理学的深度学习可以提供可重复且具有临床普适性的生物标志物，从而增强术后风险分层，并有潜力支持在实际临床环境中个性化管理前列腺癌。

View on arXiv Download PDF AI Translation

cs.CV / 160 / 2603.14188

Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis

基于迭代优化的多模态青光眼诊断联合分割与分级

Wang, Zhiwei, Li, Yuxing, Zhu, Meilu, He, Defeng, Lam, Edmund Y.

Abstract

Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at https://github.com/warren-wzw/IMO.git.

Chinese Translation

青光眼的准确诊断具有挑战性，因为早期阶段的变化微妙，通常缺乏明确的结构或外观线索。大多数现有方法依赖于单一模态，例如眼底图像或光学相干断层扫描（OCT），仅捕获部分病理信息，常常错过早期疾病进展。本文提出了一种迭代多模态优化模型（IMO），用于联合分割和分级。IMO通过中层融合策略整合眼底和OCT特征，并通过跨模态特征对齐（CMFA）模块减少模态差异。迭代细化解码器通过去噪扩散机制逐步优化多模态特征，使得能够对视盘和杯的细粒度分割，同时支持准确的青光眼分级。大量实验表明，我们的方法有效整合了多模态特征，为青光眼评估提供了一种全面且具有临床意义的方法。源代码可在 https://github.com/warren-wzw/IMO.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 161 / 2603.14189

Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions

进一步行走：长距离条件下的语义感知多模态步态识别

Lu, Zhiyang, Jiang, Wen, Wu, Tianren, Wang, Zhichao, Zhang, Changwang, Shen, Siqi, Cheng, Ming

Abstract

Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.

Chinese Translation

步态识别是一种新兴的生物识别技术，能够实现非侵入式且难以伪造的人类身份识别。然而，大多数现有方法仅限于短距离、单一模态的设置，无法在现实条件下推广到长距离和跨距离场景。为了解决这一问题，我们提出了 extbf{LRGait}，这是第一个针对多样化户外距离和环境的强健长距离步态识别的LiDAR-相机多模态基准。此外，我们进一步提出了 extbf{EMGaitNet}，这是一个为长距离多模态步态识别量身定制的端到端框架。为了弥合RGB图像与点云之间的模态差距，我们引入了一种语义引导的融合管道。基于CLIP的语义挖掘（SeMi）模块首先提取与人体部位相关的语义线索，然后通过统一嵌入空间内的语义引导对齐（SGA）模块对2D和3D特征进行对齐。对称交叉注意力融合（SCAF）模块分层整合视觉轮廓和3D几何特征，而时空（ST）模块则捕捉全局步态动态。在多个步态数据集上的广泛实验验证了我们方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 162 / 2603.14203

Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation

选择性噪声抑制与区分性互交用于鲁棒音视频分割

Peng, Kai, Shen, Yunzhe, Zhang, Miao, Liu, Leiye, Han, Yidong, Ji, Wei, Li, Jingjing, Piao, Yongri, Lu, Huchuan

Abstract

The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.

Chinese Translation

在动态视觉场景中捕捉和分割发声物体的能力对于音视频分割（Audio-Visual Segmentation, AVS）任务的发展至关重要。尽管在这一领域已经取得了显著进展，但音频与视觉模态之间的交互仍需进一步探索。在本研究中，我们旨在回答以下问题：模型如何有效抑制音频噪声，同时增强相关的音频信息？我们如何实现音频与视觉模态之间的区分性交互？为此，我们提出了SDAVS，配备选择性抗噪处理器（Selective Noise-Resilient Processor, SNRP）模块和区分性音视频互融（Discriminative Audio-Visual Mutual Fusion, DAMF）策略。所提出的SNRP通过选择性地强调相关的听觉线索来减轻音频噪声干扰，而DAMF则确保音视频表示的一致性。实验结果表明，我们提出的方法在基准AVS数据集上达到了最先进的性能，尤其是在多源和复杂场景中。代码和模型可在 https://github.com/happylife-pk/SDAVS 获取。

View on arXiv Download PDF AI Translation

cs.CV / 163 / 2603.14207

DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

DualTSR：统一的双重扩散变换器用于场景文本图像超分辨率

Niu, Axi, Zhang, Kang, Yan, Qingsen, Jin, Hao, Sun, Jinqiu, Zhang, Yanning

Abstract

Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.

Chinese Translation

场景文本图像超分辨率（STISR）旨在恢复低分辨率文本图像中的高分辨率细节，这对于人类可读性和机器识别至关重要。然而，现有的方法往往依赖于外部光学字符识别（OCR）模型来获取文本先验或依赖复杂的多组件架构，这些架构难以训练和重现。本文介绍了DualTSR，一种统一的端到端框架，解决了这两个问题。DualTSR采用单一的多模态变换器骨干，通过双重扩散目标进行训练。它通过条件流匹配同时建模高分辨率图像的连续分布以及通过离散扩散建模文本内容的离散分布。这种共享设计使得视觉信息和文本信息能够在每一层进行交互，从而使模型能够内部推断文本先验，而不依赖于外部OCR模块。与先前的多分支扩散系统相比，DualTSR提供了更简单的端到端表述，并且手工设计的组件更少。在合成中文基准和经过精心策划的真实世界评估协议上的实验表明，DualTSR实现了强大的感知质量和文本保真度。

View on arXiv Download PDF AI Translation

cs.CV / 164 / 2603.14209

ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

ChArtist：生成具有统一空间和主题控制的图示图表

Xiao, Shishi, Zhou, Tongyu, Laidlaw, David, Chan, Gromit Yeuk-Yin

Abstract

A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.

Chinese Translation

图示图表是一种有效的视觉叙事媒介，能够将视觉元素与数据图表无缝结合。然而，创建这样的图像具有挑战性，因为视觉元素的灵活性往往与图表结构的刚性相冲突。因此，这一过程需要一种创造性的变形，既保持数据的真实性，又兼顾视觉美感。目前，从自然图像中提取密集结构线索（例如，边缘或深度图）的现有方法不适合作为图示图表生成的条件信号。我们提出了ChArtist，一种用于自动生成图示图表的领域特定扩散模型，提供两种不同类型的控制：1）与图表结构良好对齐的空间控制，和2）尊重参考图像视觉特征的主题驱动控制。为此，我们引入了一种基于骨架的空间控制表示。这种表示仅编码图表的数据编码信息，允许在没有刚性轮廓约束的情况下轻松融入参考视觉。我们基于扩散变换器（Diffusion Transformer, DiT）实现了我们的方法，并利用自适应位置编码机制来管理这两种控制。我们进一步引入了空间门控注意力，以调节空间控制与主题控制之间的互动。为了支持对预训练模型的微调，我们创建了一个包含30,000个三元组（骨架、参考图像、图示图表）的规模庞大的数据集。我们还提出了一种统一的数据准确性指标，以评估生成图表的数据真实性。我们相信这项工作表明，当前的生成模型可以通过超越通用条件，转向任务特定的表示，实现以数据驱动的视觉叙事。项目页面：https://chartist-ai.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 165 / 2603.14214

UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation

UniFusion：一种具有鲁棒表示和源感知保留的统一图像融合框架

Li, Xingyuan, Du, Songcheng, Zou, Yang, Xu, HaoYuan, Jiang, Zhiying, Liu, Jinyuan

Abstract

Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion's superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.

Chinese Translation

图像融合旨在整合来自多个源图像的互补信息，以生成更具信息量和视觉一致性的表示，从而有利于人类感知和下游视觉任务。尽管最近取得了一些进展，但大多数现有的融合方法都是针对特定任务（即多模态、多曝光或多聚焦融合）设计的，且在融合过程中难以有效保留源信息。这一限制主要源于任务特定的架构以及深层传播导致的源信息退化。为了解决这些问题，我们提出了UniFusion，一种旨在实现跨任务泛化的统一图像融合框架。首先，利用DINOv3进行模态一致的特征提取，UniFusion为多样化输入建立了共享的语义空间。其次，为了保留对每个源图像的理解，我们引入了一种重建对齐损失，以保持融合输出与输入之间的一致性。最后，我们采用双层优化策略，解耦并联合优化重建和融合目标，有效平衡它们之间的耦合关系，并确保平滑收敛。在多个融合任务上的广泛实验表明，UniFusion在视觉质量、泛化能力和适应真实场景方面表现优越。代码可在 https://github.com/dusongcheng/UniFusion 获取。

View on arXiv Download PDF AI Translation

cs.CV / 166 / 2603.14219

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

安全潜力剪枝：增强安全提示以防止视觉语言模型（VLM）越狱攻击，无需重新训练

Li, Chongxin, Wang, Hanzhang, Duan, Lian

Abstract

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

Chinese Translation

安全提示构成了对视觉语言模型（VLM）越狱攻击的一层可解释防御，但其有效性受到模型潜在结构响应性的限制。我们观察到，这些提示始终涉及一组稀疏的参数，这些参数在正常使用期间大部分保持静默。这个发现促使我们提出安全子网络假说：VLM嵌入了结构上独特的路径，能够执行安全性，但这些路径在没有明确刺激的情况下保持休眠。为了揭示并放大这些路径，我们引入了安全潜力剪枝（Safety-Potential Pruning），这是一种一次性剪枝框架，通过去除对安全提示反应较弱的权重，而无需额外的重新训练，从而放大与安全相关的激活。在三种代表性的VLM架构和三个越狱基准测试中，我们的方法使攻击成功率相较于单独提示降低了多达22%，同时保持了强大的正常性能。这些发现使剪枝不仅被视为一种模型压缩技术，还作为一种结构干预，以显现与对齐相关的子网络，为实现稳健的越狱抵抗提供了一条新路径。

View on arXiv Download PDF AI Translation

cs.CV / 167 / 2603.14220

FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection

FIND：一种简单而有效的扩散生成图像检测基线

Li, Jie, Feng, Yingying, Xie, Chi, Hu, Jie, Tan, Lei, Ji, Jiayi

Abstract

The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.

Chinese Translation

扩散模型生成图像的卓越现实主义带来了重大的检测挑战。当前的方法利用重构误差作为判别特征，利用真实图像在经过扩散模型处理时表现出更高的重构误差的观察结果。然而，这些方法需要昂贵的重构计算，并且依赖于特定的扩散模型，使得它们的性能高度依赖模型。我们识别出一个基本差异：与合成图像相比，真实图像更难以用高斯分布进行拟合。本文提出了一种新方法——通过噪声干扰进行伪造识别（Forgery Identification via Noise Disturbance，FIND），该方法仅需要一个简单的二元分类器。它通过直接针对真实图像与合成图像之间的核心分布差异，消除了重构的需求。我们的关键操作是在训练过程中向真实图像添加高斯噪声，并将这些带噪声的版本标记为合成图像。这一步骤使得分类器能够关注区分真实图像与合成图像的统计模式。我们从理论上证明了被噪声增强的真实图像在高斯拟合的便利性上类似于扩散生成图像。此外，仅通过添加噪声，它们仍然保留了与原始图像的视觉相似性，突显了最具判别性的分布相关特征。所提的FIND在GenImage基准测试中提高了11.7%的性能，同时运行速度比现有方法快126倍。通过消除对辅助扩散模型和重构的需求，它提供了一种实用、高效且具有可推广性的方式来检测扩散生成的内容。

View on arXiv Download PDF AI Translation

cs.CV / 168 / 2603.14228

Not All Directions Matter: Toward Structured and Task-Aware Low-Rank Adaptation

并非所有方向都重要：面向结构化和任务感知的低秩适应

Xiao, Xi, Ma, Chenrui, Zhang, Yunbei, Liu, Chen, Wang, Zhuxuanzi, Li, Yanshu, Zhao, Lin, Hu, Guosheng, Wang, Tianyang, Xu, Hao

Abstract

Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that addresses both limitations through a principled, dual-component design: (1) an Information Bottleneck-guided filter that prunes task-irrelevant directions to mitigate semantic drift, and (2) a lightweight, training-only graph-based coordinator that enforces inter-layer consistency to resolve structural incoherence. Extensive experiments across large language model , vision language model, and vision model (including LLaMA, LLaVA, and ViT) demonstrate that StructLoRA consistently establishes a new state-of-the-art, outperforming not only vanilla LoRA but also advanced dynamic rank allocation and sparsity-based methods. Notably, the benefits are particularly pronounced in challenging low-rank and low-data regimes. Crucially, since our proposed modules operate only during training, StructLoRA enhances performance with zero additional inference cost, advancing the focus of PEFT -- from mere parameter compression to a more holistic optimization of information quality and structural integrity.

Chinese Translation

低秩适应（Low-Rank Adaptation, LoRA）已成为参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的基石。然而，其有效性受到两个基本限制的制约：语义漂移，因将所有更新方向视为同等重要，以及结构不一致，因独立适应层而导致的次优、不协调的更新。为了解决这些问题，我们提出了StructLoRA，一个通过原则性、双组件设计来应对这两种限制的框架：（1）一个信息瓶颈引导的过滤器，修剪与任务无关的方向以减轻语义漂移；（2）一个轻量级的、仅在训练期间使用的基于图的协调器，强制执行层间一致性以解决结构不一致。针对大型语言模型、视觉语言模型和视觉模型（包括LLaMA、LLaVA和ViT）的广泛实验表明，StructLoRA始终建立了新的最先进水平，不仅超越了传统的LoRA，还超越了先进的动态秩分配和基于稀疏的方法。值得注意的是，这些好处在具有挑战性的低秩和低数据环境中尤为明显。关键的是，由于我们提出的模块仅在训练期间运行，StructLoRA在不增加推理成本的情况下提升了性能，推动了PEFT的重点从单纯的参数压缩转向信息质量和结构完整性的更全面优化。

View on arXiv Download PDF AI Translation

cs.CV / 169 / 2603.14232

S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction

S2GS：在线场景理解与重建的流式语义高斯点云

Zhang, Renhe, Tan, Yuyang, Gong, Jingyu, Zhang, Zhizhong, Ma, Lizhuang, Xie, Yuan, Tan, Xin

Abstract

Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.

Chinese Translation

现有的离线前馈方法在长图像流上的联合场景理解与重建，通常需要对不断增长的历史观测集合反复执行全局计算，这导致运行时和GPU内存随着序列长度迅速增加，从而限制了可扩展性。我们提出了流式语义高斯点云（Streaming Semantic Gaussian Splatting, S2GS），这是一个严格因果的增量3D高斯语义场框架：它不利用未来帧，并持续更新场景几何、外观和实例级语义，而无需重新处理历史帧，从而实现可扩展的在线联合重建与理解。S2GS采用几何-语义解耦的双主干设计：几何分支执行因果建模以驱动增量高斯更新，而语义分支则利用2D基础视觉模型和查询驱动解码器来预测分割掩码和身份嵌入，进一步通过查询级对比对齐和与实例内存的轻量在线关联来稳定。实验表明，S2GS在联合重建与理解基准测试中与强大的离线基线相匹配或超越，同时显著提高了长时间可扩展性：它处理超过1000帧时，运行时和GPU内存的增长速度要慢得多，而离线全局处理基线在相同设置下通常在大约80帧时就会耗尽内存。

View on arXiv Download PDF AI Translation

cs.CV / 170 / 2603.14240

FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains

FOCUS：跨领域细粒度识别与开放世界发现的桥梁

Rathore, Vaibhav, Gupta, Divyam, Abdar, Moloud, Chaudhuri, Subhasis, Banerjee, Biplab

Abstract

We introduce the first unified framework for *Fine-Grained Domain-Generalized Generalized Category Discovery* (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first *FG-DG-GCD benchmarks* by creating identity-preserving *painting* and *sketch* domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines *Domain-Consistent Parts Discovery* (DCPD) for geometry-stable part reasoning with *Uncertainty-Aware Feature Augmentation* (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by **3.28%**, **9.68%**, and **2.07%**, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly **3x** higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.]

Chinese Translation

我们提出了第一个统一框架用于*细粒度领域泛化类别发现*（FG-DG-GCD），使开放世界识别更接近于在领域转移下的实际部署。与传统的类别发现（GCD）不同，后者假设标记和未标记数据来自相同的分布，DG-GCD仅从标记的源数据中学习，必须在未见的未标记目标领域中识别已知类别并发现新类别。这个问题在细粒度设置中尤其具有挑战性，因为细微的类间差异和大的类内变异使得领域泛化变得更加困难。为了支持系统评估，我们通过创建保持身份的*绘画*和*素描*领域，为CUB-200-2011、斯坦福汽车和FGVC-飞机建立了第一个*FG-DG-GCD基准*，使用受控的扩散适配器风格化。在此基础上，我们提出了FoCUS，一个单阶段框架，结合了*领域一致的部件发现*（DCPD）用于几何稳定的部件推理，以及*不确定性感知特征增强*（UFA）通过不确定性引导的扰动进行置信度校准的特征正则化。大量实验表明，FoCUS在所提出的基准上在聚类准确性方面分别比强大的GCD、FG-GCD和DG-GCD基线提高了**3.28%**、**9.68%**和**2.07%**。它在粗粒度DG-GCD任务中也保持竞争力，同时实现了比当前最先进技术高出近**3倍**的计算效率。^[代码和数据集将在接受后发布。]

View on arXiv Download PDF AI Translation

cs.CV / 171 / 2603.14241

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

CamLit：具有明确相机和光照控制的统一视频扩散模型

Kuang, Zhiyi, He, Chengan, Zakharov, Egor, Xue, Yuxuan, Saito, Shunsuke, Maury, Olivier, Bagautdinov, Timur, Zheng, Youyi, Nam, Giljoo

Abstract

We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.

Chinese Translation

我们提出了CamLit，这是第一个统一的视频扩散模型，能够从单个输入图像中联合执行新视角合成（NVS）和重光照。给定一张参考图像、用户定义的相机轨迹和环境图，CamLit能够在指定的光照下，从新的视点合成场景的视频。在单一的生成过程中，我们的模型生成时间上连贯且空间上对齐的输出，包括重光照的新视角帧和相应的反照率帧，从而实现对相机姿态和光照的高质量控制。定性和定量实验表明，CamLit在新视角合成和重光照方面的高保真输出与最先进的方法相当，同时在两个任务中都没有牺牲视觉质量。我们展示了一个单一的生成模型能够有效整合相机和光照控制，简化视频生成流程，同时保持竞争力的性能和一致的真实感。

View on arXiv Download PDF AI Translation

cs.CV / 172 / 2603.14243

BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

BIT：基于匹配的双向交互转换网络用于可见光-红外行人重识别

Xu, Haoxuan, Niu, Guanglin

Abstract

Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.

Chinese Translation

可见光-红外行人重识别（VI-ReID）是一项具有挑战性的检索任务，因为可见光图像和红外图像之间存在显著的模态差异。虽然现有方法试图通过在共享嵌入空间中学习模态不变特征来弥合这一差距，但它们往往忽视了模态之间复杂而隐含的相关性。在分布变化的情况下，这一限制变得更加严重，因为红外样本通常远少于可见光样本。为了解决这些挑战，我们提出了一种新颖的网络，称为双向交互转换（BIT）。BIT并不依赖于刚性的特征对齐，而是采用基于匹配的策略，明确建模可见光和红外图像对之间的交互。具体而言，BIT采用编码器-解码器架构，其中编码器提取初步特征表示，解码器则执行双向特征融合和查询感知评分，以增强跨模态对应关系。根据我们所知，BIT是首个在VI-ReID中引入这种成对匹配驱动的交互的方法。大量在多个基准上的实验表明，我们的BIT实现了最先进的性能，突显了其在VI-ReID任务中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 173 / 2603.14249

OAHuman: Occlusion-Aware 3D Human Reconstruction from Monocular Images

OAHuman：基于单目图像的遮挡感知3D人类重建

Yang, Yuanwang, Liu, Hongliang, Zhang, Muxin, Ma, Nan, Yang, Jingyu, Lai, Yu-Kun, Li, Kun

Abstract

Monocular 3D human reconstruction in real-world scenarios remains highly challenging due to frequent occlusions from surrounding objects, people, or image truncation. Such occlusions lead to missing geometry and unreliable appearance cues, severely degrading the completeness and realism of reconstructed human models. Although recent neural implicit methods achieve impressive results on clean inputs, they struggle under occlusion due to entangled modeling of shape and texture. In this paper, we propose OAHuman, an occlusion-aware framework that explicitly decouples geometry reconstruction and texture synthesis for robust 3D human modeling from a single RGB image. The core innovation lies in the decoupling-perception paradigm, which addresses the fundamental issue of geometry-texture cross-contamination in occluded regions. Our framework ensures that geometry reconstruction is perceptually reinforced even in occluded areas, isolating it from texture interference. In parallel, texture synthesis is learned exclusively from visible regions, preventing texture errors from being transferred to the occluded areas. This decoupling approach enables OAHuman to achieve robust and high-fidelity reconstruction under occlusion, which has been a long-standing challenge in the field. Extensive experiments on occlusion-rich benchmarks demonstrate that OAHuman achieves superior performance in terms of structural completeness, surface detail, and texture realism, significantly improving monocular 3D human reconstruction under occlusion conditions.

Chinese Translation

在现实场景中，基于单目图像的3D人类重建仍然面临巨大挑战，因为周围物体、人物或图像截断经常导致遮挡。这些遮挡会导致几何信息缺失和外观线索不可靠，严重降低重建人类模型的完整性和真实感。尽管最近的神经隐式方法在干净输入上取得了令人印象深刻的结果，但在遮挡情况下，由于形状和纹理的交织建模，它们表现不佳。本文提出了OAHuman，一个遮挡感知框架，明确地将几何重建和纹理合成解耦，以实现从单个RGB图像进行稳健的3D人类建模。核心创新在于解耦感知范式，解决了遮挡区域中几何与纹理交叉污染的根本问题。我们的框架确保即使在遮挡区域，几何重建也能得到感知上的增强，使其与纹理干扰隔离。同时，纹理合成仅从可见区域学习，防止纹理错误转移到遮挡区域。这种解耦方法使OAHuman在遮挡情况下实现了稳健且高保真的重建，这在该领域一直是一个长期挑战。在遮挡丰富的基准测试上的大量实验表明，OAHuman在结构完整性、表面细节和纹理真实感方面表现优越，显著改善了在遮挡条件下的单目3D人类重建。

View on arXiv Download PDF AI Translation

cs.CV / 174 / 2603.14252

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

MistExit：学习在程序视频中进行早期错误检测的退出策略

Majumder, Sagnik, Nethi, Anish, Al-Halah, Ziad, Grauman, Kristen

Abstract

We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.

Chinese Translation

我们提出了一种视频中的早期错误检测任务，其目标是在尽可能少观察流媒体视频的情况下，判断程序活动中的关键步骤是否正确执行。为了解决这个问题，我们提出了一种方法，包括一个错误检测器和一个强化学习策略。在每个时间步，检测器处理最近观察到的帧，以估计关键步骤的正确性，同时预测未来的视觉特征，从而实现可靠的早期错误估计。同时，策略聚合检测器的输出和视觉观察，并根据时间自适应地决定何时退出（即停止处理传入帧），同时生成最终预测。通过使用多样的真实世界程序视频数据集，我们证明了我们的MistExit模型在提高错误检测准确性的同时，相较于最先进的模型减少了观察到的视频比例。项目网址：https://vision.cs.utexas.edu/projects/mist_exit。

View on arXiv Download PDF AI Translation

cs.CV / 175 / 2603.14254

ZOTTA: Test-Time Adaptation with Gradient-Free Zeroth-Order Optimization

ZOTTA：基于无梯度零阶优化的测试时适应

Zhang, Ronghao, Niu, Shuaicheng, Deng, Qi, Dong, Yanjie, Chen, Jian, Zeng, Runhao

Abstract

Test-time adaptation (TTA) aims to improve model robustness under distribution shifts by adapting to unlabeled test data, but most existing methods rely on backpropagation (BP), which is computationally costly and incompatible with non-differentiable models such as quantized models, limiting practical deployment on numerous edge devices. Recent BP-free approaches alleviate overhead but remain either architecture-specific or limited in optimization capacity to handle high-dimensional models. We propose ZOTTA, a fully BP-free TTA framework that performs efficient adaptation using only forward passes via Zeroth-Order Optimization (ZOO). While ZOO is theoretically appealing, naive application leads to slow convergence under high-dimensional parameter spaces and unstable optimization due to the lack of labels. ZOTTA overcomes these challenges through 1) Distribution-Robust Layer Selection, which automatically identifies and freezes layers that already extract distribution-invariant features, updating only domain-sensitive layers to reduce the optimization dimensionality and accelerate convergence; 2) Spatial Feature Aggregation Alignment, which stabilizes ZOO by aligning globally aggregated spatial features between source and target to reduce gradient variance. Together, these components enable architecture-agnostic and stable BP-free adaptation. Extensive experiments on ImageNet-C/R/Sketch/A show that ZOTTA outperforms or matches BP-based methods, e.g., it reduces memory usage by 84% and improves accuracy by 3.9% over SAR on ImageNet-C.

Chinese Translation

测试时适应（TTA）旨在通过适应未标记的测试数据来提高模型在分布变化下的鲁棒性，但大多数现有方法依赖于反向传播（BP），这在计算上成本高且与非可微模型（如量化模型）不兼容，限制了在众多边缘设备上的实际部署。近期的无BP方法减轻了开销，但仍然要么特定于架构，要么在优化能力上有限，无法处理高维模型。我们提出了ZOTTA，一个完全无BP的TTA框架，通过仅使用正向传播和零阶优化（ZOO）进行高效适应。虽然ZOO在理论上具有吸引力，但在高维参数空间下的简单应用会导致收敛缓慢和由于缺乏标签而导致的不稳定优化。ZOTTA通过以下方式克服了这些挑战：1）分布鲁棒层选择，自动识别并冻结已经提取分布不变特征的层，仅更新对领域敏感的层，从而减少优化维度并加速收敛；2）空间特征聚合对齐，通过对齐源和目标之间全局聚合的空间特征来稳定ZOO，从而减少梯度方差。综合这些组件，使得ZOTTA能够实现与架构无关且稳定的无BP适应。在ImageNet-C/R/Sketch/A上的大量实验表明，ZOTTA的性能优于或与基于BP的方法相匹配，例如，在ImageNet-C上，它将内存使用减少了84%，并将准确率提高了3.9%相较于SAR。

View on arXiv Download PDF AI Translation

cs.CV / 176 / 2603.14267

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

DiFlowDubber：通过跨模态对齐和同步实现自动视频配音的离散流匹配

Nguyen, Ngoc-Son, Tran, Thanh V. T., Choi, Jeongsoo, Huynh-Nguyen, Hieu-Nghia, Hy, Truong-Son, Nguyen, Van

Abstract

Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

Chinese Translation

视频配音在电影制作、多媒体创作和辅助语音技术中具有广泛的应用。现有的方法要么直接在有限的配音数据集上进行训练，要么采用两阶段流程，适配预训练的文本到语音（TTS）模型，这些模型往往难以产生富有表现力的韵律、丰富的声学特征和精确的同步。为了解决这些问题，我们提出了DiFlowDubber，采用一种新颖的两阶段训练框架，有效地将知识从预训练的TTS模型转移到视频驱动的配音中，并使用离散流匹配生成骨干网络。具体而言，我们设计了一个FaPro模块，从面部表情中捕捉全局韵律和风格线索，并利用这些信息指导后续语音属性的建模。为了确保精确的语音与唇动同步，我们引入了一个Synchronizer模块，弥合文本、视频和语音之间的模态差距，从而改善跨模态对齐，并生成与唇部动作在时间上同步的语音。在两个主要基准数据集上的实验表明，DiFlowDubber在多个指标上优于之前的方法。

View on arXiv Download PDF AI Translation

cs.CV / 177 / 2603.14271

Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs

朝着临床就绪的医学图像分析基础模型：适应机制与部署权衡

Phuntsho, Karma, Abdullah, Lee, Kyungmi, Lee, Ickjai, Ahn, Euijoon

Abstract

Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.

Chinese Translation

基础模型（FMs）在医学影像任务中展示了强大的迁移能力，但其临床实用性在很大程度上依赖于如何将预训练表示适应于特定领域的数据、监督机制和部署限制。以往的调查主要强调架构进展和应用覆盖，而适应机制及其对鲁棒性、校准和监管可行性的影响仍然缺乏系统性。本综述引入了一个以策略为中心的医学图像分析（MIA）基础模型适应框架。我们将适应概念化为一种后预训练干预，并将现有方法组织为五种机制：参数、表示、目标、数据中心和架构/序列级适应。对于每种机制，我们分析了适应深度、标签效率、领域鲁棒性、计算成本、可审计性和监管负担之间的权衡。我们综合了分类、分割和检测任务的证据，强调适应策略如何影响临床相关的失败模式，而不仅仅是整体基准性能。最后，我们考察了适应选择如何与验证协议、校准稳定性、多机构部署和监管监督相互作用。通过将适应重新框架为在临床限制下的受控表示变化过程，本综述为设计基于基础模型的系统提供了实用指导，使其在临床部署中具备鲁棒性、可审计性和兼容性。

View on arXiv Download PDF AI Translation

cs.CV / 178 / 2603.14276

All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation

全天候多场景终身视觉与语言导航的塔克适应

Wang, Xudong, Li, Gan, Liu, Zhiyu, Wang, Yao, Liu, Lianqing, Han, Zhi

Abstract

Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.

Chinese Translation

部署视觉与语言导航（VLN）代理需要在多样化的场景和环境中进行适应，但在特定场景上的微调往往会导致在其他场景中的灾难性遗忘，这严重限制了灵活的长期部署。我们将这一挑战形式化为全天候多场景终身VLN（AML-VLN）问题。现有的参数高效适配器（如LoRA及其变体）受到其二维矩阵形式的限制，无法捕捉跨越多个场景和环境的多层次导航知识。为了解决这一问题，我们提出了塔克适应（Tucker Adaptation，TuKA），将多层次导航知识表示为高阶张量，并利用塔克分解将知识解耦为共享子空间和特定场景专家。我们进一步引入了一种解耦知识增量学习策略，以巩固共享子空间，同时约束特定专家以实现解耦的终身学习。在TuKA的基础上，我们还开发了一种名为AlldayWalker的VLN代理，它能够在多个导航场景中持续学习，实现全天候多场景导航。大量实验表明，AlldayWalker在各项指标上始终优于最先进的基线。

View on arXiv Download PDF AI Translation

cs.CV / 179 / 2603.14281

DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images

DC-ViT：调节多通道图像的空间和通道交互

Marikkar, Umar, Husain, Syed Sameed, Awais, Muhammad, Atito, Sara

Abstract

Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.

Chinese Translation

多通道成像（MCI）的训练和评估仍然面临挑战，这主要是由于不同染色协议、传感器类型和采集设置导致的异构通道配置。这种异质性限制了在一般计算机视觉中常用的固定通道编码器的适用性。近期的多通道视觉变换器（MC-ViTs）通过在统一的注意力空间中共同编码来自所有通道的补丁令牌，解决了这一问题，从而实现灵活的通道输入。然而，通道间不受限制的令牌交互可能导致特征稀释，降低了在MCI数据中保持通道特定语义的能力。为了解决这一问题，我们提出了解耦视觉变换器（DC-ViT），该模型通过解耦自注意力（DSA）明确调节信息共享，将令牌更新分解为两个互补路径：建模通道内结构的空间更新和自适应整合跨通道信息的通道更新。这种解耦减轻了信息崩溃，同时允许选择性地进行通道间交互。为了进一步利用这些增强的通道特定表示，我们引入了解耦聚合（DAG），使模型能够学习任务特定的通道重要性。在三个MCI基准测试中的广泛实验表明，相较于现有的MC-ViT方法，我们的方法在性能上有一致的提升。

View on arXiv Download PDF AI Translation

cs.CV / 180 / 2603.14282

Multi-Period Texture Contrast Enhancement for Low-Contrast Wafer Defect Detection and Segmentation

低对比度晶圆缺陷检测与分割的多周期纹理对比度增强

Zhang, Zihan

Abstract

Wafer defect segmentation is pivotal for semiconductor yield optimization yet remains challenged by the intrinsic conflict between microscale anomalies and highly periodic, overwhelming background textures. Existing deep learning paradigms often falter due to feature dilution during downsampling and the lack of explicit mechanisms to disentangle low-contrast defects from process-induced noise. To transcend these limitations, we propose TexWDS, a texture-aware framework that harmonizes multi-scale feature retention with frequency-domain perturbation modeling. Our methodology incorporates three strategic innovations: (1) A Multi-scale Receptive Field Reweighting strategy is introduced to mitigate aliasing effects and preserve high-frequency details of micro-defects often lost in standard pyramidal architectures. (2) The Multi-scale Unified Semantic Enhancer (MUSE) integrates local appearance with global context encoding, effectively enhancing feature discriminability in low-visibility regions. (3) Crucially, we design a plug-and-play Multi-Periodic Texture Contrast Enhancement (MPTCE) module. By modeling texture disruptions in the frequency domain, MPTCE explicitly decouples non-periodic anomalies from structured backgrounds, boosting contrast for camouflaged defects. Extensive experiments on real-world industrial datasets demonstrate that TexWDS achieves a new state-of-the-art, surpassing the baseline by 8.3% in mAP50-95 and 7.7% in recall, while reducing the false positive rate by approximately 8.6%. These results underscore the framework's robustness in handling complex periodic patterns and its suitability for high-precision manufacturing inspection.

Chinese Translation

晶圆缺陷分割在半导体产量优化中至关重要，但仍面临微观尺度异常与高度周期性的背景纹理之间的内在冲突。现有的深度学习范式往往因下采样过程中的特征稀释以及缺乏明确机制来区分低对比度缺陷与工艺诱导噪声而不尽如人意。为超越这些局限性，我们提出了TexWDS，一个纹理感知框架，旨在将多尺度特征保留与频域扰动建模相结合。我们的方法包括三项战略创新：(1) 引入多尺度感受野重标定策略，以减轻混叠效应并保持在标准金字塔架构中常常丢失的微缺陷的高频细节。(2) 多尺度统一语义增强器（MUSE）将局部外观与全局上下文编码相结合，有效提高低可视性区域中特征的可分辨性。(3) 关键的是，我们设计了一个即插即用的多周期纹理对比度增强（MPTCE）模块。通过在频域中建模纹理干扰，MPTCE 明确将非周期性异常与结构化背景解耦，从而提升隐蔽缺陷的对比度。在真实工业数据集上的广泛实验表明，TexWDS 实现了新的最先进水平，在 mAP50-95 上超越基线 8.3%，在召回率上超越 7.7%，同时将误报率降低约 8.6%。这些结果强调了该框架在处理复杂周期模式方面的鲁棒性以及其在高精度制造检测中的适用性。

View on arXiv Download PDF AI Translation

cs.CV / 181 / 2603.14290

RegFormer++: An Efficient Large-Scale 3D LiDAR Point Registration Network with Projection-Aware 2D Transformer

RegFormer++：一种高效的大规模3D LiDAR点云配准网络，具有投影感知的2D变换器

Liu, Jiuming, Wang, Guangming, Liu, Zhe, Jiang, Chaokang, Li, Haoang, Liu, Mengmeng, Deng, Tianchen, Pollefeys, Marc, Yang, Michael Ying, Wang, Hesheng

Abstract

Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale LiDAR registration methods has been rarely explored before. Challenges mainly arise from the huge point scale, complex point distribution, and numerous outliers within outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local descriptors and then leverage robust estimators (e.g. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose a novel end-to-end differential transformer network, termed RegFormer++, for large-scale point cloud alignment without requiring any further post-processing. Specifically, a hierarchical projection-aware 2D transformer with linear complexity is proposed to project raw LiDAR points onto a cylindrical surface and extract global point features, which can improve resilience to outliers due to long-range dependencies. Because we fill original 3D coordinates into 2D projected positions, our designed transformer can benefit from both high efficiency in 2D processing and accuracy from 3D geometric information. Furthermore, to effectively reduce wrong point matching, a Bijective Association Transformer (BAT) is designed, combining both cross attention and all-to-all point gathering. To improve training stability and robustness, a feature-transformed optimal transport module is also designed for regressing the final pose transformation. Extensive experiments on KITTI, NuScenes, and Argoverse datasets demonstrate that our model achieves state-of-the-art performance in terms of both accuracy and efficiency.

Chinese Translation

尽管点云配准在物体级别和室内场景中取得了显著进展，但大规模LiDAR配准方法仍然鲜有探索。主要挑战来自于户外LiDAR扫描中巨大的点云规模、复杂的点分布以及大量的离群点。此外，大多数现有的配准工作通常采用两阶段范式：首先通过提取具有区分性的局部描述符来寻找对应关系，然后利用鲁棒估计器（例如RANSAC）来过滤离群点，这高度依赖于精心设计的描述符和后处理选择。为了解决这些问题，我们提出了一种新颖的端到端差分变换器网络，称为RegFormer++，用于大规模点云对齐，无需任何进一步的后处理。具体而言，提出了一种具有线性复杂度的分层投影感知2D变换器，将原始LiDAR点投影到圆柱面上并提取全局点特征，这可以提高对离群点的鲁棒性，因其具有长距离依赖性。由于我们将原始3D坐标填充到2D投影位置，我们设计的变换器可以同时受益于2D处理的高效率和3D几何信息的准确性。此外，为了有效减少错误的点匹配，设计了一种双向关联变换器（Bijective Association Transformer, BAT），结合了交叉注意力和全对全的点聚合。为了提高训练的稳定性和鲁棒性，我们还设计了一个特征变换的最优传输模块，用于回归最终的姿态变换。在KITTI、NuScenes和Argoverse数据集上进行的广泛实验表明，我们的模型在准确性和效率方面均实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 182 / 2603.14294

Seeking Physics in Diffusion Noise

在扩散噪声中寻找物理规律

Tang, Chujun, Zhong, Lei, Ding, Fangqiang

Abstract

Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

Chinese Translation

视频扩散模型是否编码了与物理合理性相关的信号？我们探讨了预训练的扩散变换器（Diffusion Transformer, DiT）的中间去噪表示，发现物理合理和不合理的视频在不同噪声水平的中层特征空间中部分可分。这种可分性不能完全归因于视觉质量或生成器身份，表明在冻结的 DiT 特征中存在可恢复的物理相关线索。基于这一观察，我们引入了渐进轨迹选择，这是一种在推理时的策略，通过在几个中间检查点对并行去噪轨迹进行评分，使用在冻结特征上训练的轻量级物理验证器，并提前修剪低评分候选者。在 PhyGenBench 上的广泛实验表明，我们的方法在提高物理一致性的同时降低了推理成本，取得了与最佳 K 采样相当的结果，但去噪步骤显著减少。

View on arXiv Download PDF AI Translation

cs.CV / 183 / 2603.14297

RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360{\deg}Image Quality Assessment

RL-ScanIQA：用于盲目360°图像质量评估的强化学习扫描路径

Wang, Yujia, Li, Yuyan, Liu, Jiuming, Zhang, Fang-Lue, Zheng, Xinhu, Dodgson, Neil. A

Abstract

Blind 360{\deg}image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360{\deg}content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view-then-rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL-ScanIQA, a reinforcement-learned framework for blind 360{\deg}IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross-dataset robustness using distortion-space augmentation together with rank-consistent losses that preserve intra-image and inter-image quality orderings. Extensive experiments on three benchmarks show that RL-ScanIQA achieves superior in-dataset performance and cross-dataset generalization. Codes are available at https://github.com/wangyuji1/RLScanIQA.git.

Chinese Translation

盲目360°图像质量评估（IQA）旨在在没有原始参考的情况下预测全景图像的感知质量。与传统的平面图像不同，沉浸式环境中的360°内容在任何时刻都限制观众的视野，使得观看行为对质量感知至关重要。尽管现有的基于扫描路径的方法试图通过近似人类的观看-评分范式来建模观看行为，但它们将扫描路径生成和质量评估视为两个独立的步骤，从而阻碍了端到端的优化和任务对齐的探索。为了解决这一局限性，我们提出了RL-ScanIQA，这是一个用于盲目360°IQA的强化学习框架。RL-ScanIQA优化了一个经过PPO训练的扫描路径策略和一个质量评估器，其中策略接收基于质量的反馈，以学习与任务相关的观看策略。为了提高训练的稳定性并防止模式崩溃，我们设计了多层次的奖励，包括扫描路径多样性和赤道偏向先验。我们进一步通过失真空间增强和保持图像内部及图像间质量排序的一致性损失来增强跨数据集的鲁棒性。在三个基准上的大量实验表明，RL-ScanIQA在数据集内性能和跨数据集泛化方面均表现优越。代码可在 https://github.com/wangyuji1/RLScanIQA.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 184 / 2603.14300

Show Me When and Where: Towards Referring Video Object Segmentation in the Wild

展示我何时何地：面向野外视频物体分割的研究

Gao, Mingqi, Yang, Jinyu, Luo, Jingnan, Zhen, Xiantong, Han, Jungong, Montana, Giovanni, Zheng, Feng

Abstract

Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.

Chinese Translation

近年来，指向视频物体分割（Referring Video Object Segmentation, RVOS）因其广泛的应用而在计算机视觉领域引起了极大的关注。现有的 RVOS 设置包含精心剪辑的视频，其中文本所指的物体总是在所有帧中出现，这并未能充分反映该任务的现实挑战。这种简化的设置要求 RVOS 方法仅需预测物体出现的位置，而无需显示物体出现的时间。在本研究中，我们引入了一种新的野外 RVOS 设置。为此，我们使用 YouTube 未剪辑视频收集了一个新的基准数据集——YoURVOS，该数据集包含 1,120 个野外视频，时长和场景数量是现有数据集的 7 倍。我们的新基准挑战 RVOS 方法不仅要展示物体出现的位置，还要展示物体出现的时间。为了设定基线，我们提出了物体级多模态变换器（Object-level Multimodal Transformers, OMFormer）来应对这些挑战，其特点是编码物体级多模态交互，以实现高效的全局时空定位。我们展示了以往的 VOS 方法在我们的 YoURVOS 基准上表现不佳，尤其是在目标缺失帧增加的情况下，而我们的 OMFormer 始终表现良好。我们的 YoURVOS 数据集提供了一个重要的基准，将推动 RVOS 方法在实际应用中的进展。

View on arXiv Download PDF AI Translation

cs.CV / 185 / 2603.14301

4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding

4D同步场：用于时间场景理解的运动-语言高斯点云

Barhdadi, Mohamed Rayan, Abdaljalil, Samir, Khanbayov, Rasul, Serpedin, Erchin, Kurban, Hasan

Abstract

Current 4D representations decouple geometry, motion, and semantics: reconstruction methods discard interpretable motion structure; language-grounded methods attach semantics after motion is learned, blind to how objects move; and motion-aware methods encode dynamics as opaque per-point residuals without object-level organization. We propose 4D Synchronized Fields, a 4D Gaussian representation that learns object-factored motion in-loop during reconstruction and synchronizes language to the resulting kinematics through a per-object conditioned field. Each Gaussian trajectory is decomposed into shared object motion plus an implicit residual, and a kinematic-conditioned ridge map predicts temporal semantic variation, yielding a single representation in which reconstruction, motion, and semantics are structurally coupled and enabling open-vocabulary temporal queries that retrieve both objects and moments. On HyperNeRF, 4D Synchronized Fields achieves 28.52 dB mean PSNR, the highest among all language-grounded and motion-aware baselines, within 1.5 dB of reconstruction-only methods. On targeted temporal-state retrieval, the kinematic-conditioned field attains 0.884 mean accuracy, 0.815 mean vIoU, and 0.733 mean tIoU, surpassing 4D LangSplat (0.620, 0.433, and 0.439 respectively) and LangSplat (0.415, 0.304, and 0.262). Ablation confirms that kinematic conditioning is the primary driver, accounting for +0.45 tIoU over a static-embedding-only baseline. 4D Synchronized Fields is the only method that jointly exposes interpretable motion primitives and temporally grounded language fields from a single trained representation. Code will be released.

Chinese Translation

当前的4D表示方法将几何、运动和语义解耦：重建方法舍弃可解释的运动结构；基于语言的方法在学习运动后附加语义，忽视了物体的运动方式；而运动感知方法将动态编码为不透明的逐点残差，缺乏物体级别的组织。我们提出了4D同步场，这是一种4D高斯表示，在重建过程中学习物体分解的运动，并通过每个物体条件场将语言与生成的运动学同步。每个高斯轨迹被分解为共享的物体运动加上一个隐含残差，运动学条件的脊图预测时间语义变化，从而产生一个结构上耦合重建、运动和语义的单一表示，能够进行开放词汇的时间查询，检索物体和时刻。在HyperNeRF上，4D同步场实现了28.52 dB的平均PSNR，成为所有基于语言和运动感知基线中最高的，距离仅重建方法的结果相差1.5 dB。在针对时间状态检索中，运动学条件场达到了0.884的平均准确率、0.815的平均vIoU和0.733的平均tIoU，超越了4D LangSplat（分别为0.620、0.433和0.439）和LangSplat（分别为0.415、0.304和0.262）。消融实验确认运动学条件是主要驱动因素，相较于仅使用静态嵌入的基线提高了0.45的tIoU。4D同步场是唯一一种能够从单一训练表示中共同揭示可解释的运动原语和时间基础的语言场的方法。代码将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 186 / 2603.14304

A Physically-Grounded Attack and Adaptive Defense Framework for Real-World Low-Light Image Enhancement

基于物理的攻击与自适应防御框架用于现实世界低光照图像增强

Zhang, Tongshun, Liu, Pingping, Lei, Yuqing, Zhong, Zixuan, Zhou, Qiuzhan, Zha, Zhiyuan

Abstract

Limited illumination often causes severe physical noise and detail degradation in images. Existing Low-Light Image Enhancement (LLIE) methods frequently treat the enhancement process as a blind black-box mapping, overlooking the physical noise transformation during imaging, leading to suboptimal performance. To address this, we propose a novel LLIE approach, conceptually formulated as a physics-based attack and display-adaptive defense paradigm. Specifically, on the attack side, we establish a physics-based Degradation Synthesis (PDS) pipeline. Unlike standard data augmentation, PDS explicitly models Image Signal Processor (ISP) inversion to the RAW domain, injects physically plausible photon and read noise, and re-projects the data to the sRGB domain. This generates high-fidelity training pairs with explicitly parameterized degradation vectors, effectively simulating realistic attacks on clean signals. On the defense side, we construct a dual-layer fortified system. A noise predictor estimates degradation parameters from the input sRGB image. These estimates guide a degradation-aware Mixture of Experts (DA-MoE), which dynamically routes features to experts specialized in handling specific noise intensities. Furthermore, we introduce an Adaptive Metric Defense (AMD) mechanism, dynamically calibrating the feature embedding space based on noise severity, ensuring robust representation learning under severe degradation. Extensive experiments demonstrate that our approach offers significant plug-and-play performance enhancement for existing benchmark LLIE methods, effectively suppressing real-world noise while preserving structural fidelity. The sourced code is available at https://github.com/bywlzts/Attack-defense-llie.

Chinese Translation

有限的光照常常导致图像中严重的物理噪声和细节退化。现有的低光照图像增强（LLIE）方法通常将增强过程视为盲目的黑箱映射，忽视了成像过程中物理噪声的转化，从而导致次优的性能。为了解决这一问题，我们提出了一种新颖的LLIE方法，概念上构建为基于物理的攻击与显示自适应防御范式。具体而言，在攻击方面，我们建立了一个基于物理的退化合成（PDS）管道。与标准数据增强不同，PDS明确建模图像信号处理器（ISP）向RAW域的反演，注入物理上合理的光子噪声和读取噪声，并将数据重新投影到sRGB域。这生成了具有明确参数化退化向量的高保真训练对，有效模拟了对干净信号的真实攻击。在防御方面，我们构建了一个双层强化系统。噪声预测器从输入的sRGB图像中估计退化参数。这些估计值指导一个退化感知的专家混合模型（DA-MoE），该模型动态地将特征路由到专门处理特定噪声强度的专家。此外，我们引入了一种自适应度量防御（AMD）机制，根据噪声严重程度动态校准特征嵌入空间，确保在严重退化下的鲁棒表示学习。大量实验表明，我们的方法为现有基准LLIE方法提供了显著的即插即用性能提升，有效抑制现实世界中的噪声，同时保持结构保真度。源代码可在 https://github.com/bywlzts/Attack-defense-llie 获取。

View on arXiv Download PDF AI Translation

cs.CV / 187 / 2603.14309

In-Field 3D Wheat Head Instance Segmentation From TLS Point Clouds Using Deep Learning Without Manual Labels

基于深度学习的无手动标签的地面激光扫描点云中的3D小麦穗实例分割

Medic, Tomislav, Nan, Liangliang

Abstract

3D instance segmentation for laser scanning (LiDAR) point clouds remains a challenge in many remote sensing-related domains. Successful solutions typically rely on supervised deep learning and manual annotations, and consequently focus on objects that can be well delineated through visual inspection and manual labeling of point clouds. However, for tasks with more complex and cluttered scenes, such as in-field plant phenotyping in agriculture, such approaches are often infeasible. In this study, we tackle the task of in-field wheat head instance segmentation directly from terrestrial laser scanning (TLS) point clouds. To address the problem and circumvent the need for manual annotations, we propose a novel two-stage pipeline. To obtain the initial 3D instance proposals, the first stage uses 3D-to-2D multi-view projections, the Grounded SAM pipeline for zero-shot 2D object-centric segmentation, and multi-view label fusion. The second stage uses these initial proposals as noisy pseudo-labels to train a supervised 3D panoptic-style segmentation neural network. Our results demonstrate the feasibility of the proposed approach and show performance improvementsrelative to Wheat3DGS, a recent alternative solution for in-field wheat head instance segmentation without manual 3D annotations based on multi-view RGB images and 3D Gaussian Splatting, showcasing TLS as a competitive sensing alternative. Moreover, the results show that both stages of the proposed pipeline can deliver usable 3D instance segmentation without manual annotations, indicating promising, low-effort transferability to other comparable TLS-based point cloud segmentation tasks.

Chinese Translation

激光扫描（LiDAR）点云的3D实例分割在许多遥感相关领域仍然是一个挑战。成功的解决方案通常依赖于监督深度学习和手动注释，因此重点关注那些可以通过视觉检查和手动标记点云进行良好划分的对象。然而，对于更复杂和杂乱的场景任务，例如农业中的田间植物表型分析，这种方法往往不可行。在本研究中，我们直接从地面激光扫描（TLS）点云中解决田间小麦穗实例分割的任务。为了解决这一问题并规避手动注释的需求，我们提出了一种新颖的两阶段管道。第一阶段使用3D到2D的多视图投影、用于零样本2D对象中心分割的Grounded SAM管道以及多视图标签融合来获取初始的3D实例提议。第二阶段使用这些初始提议作为噪声伪标签来训练一个监督的3D全景风格分割神经网络。我们的结果证明了所提方法的可行性，并显示出相对于Wheat3DGS的性能提升，后者是基于多视图RGB图像和3D高斯散点的无手动3D注释的田间小麦穗实例分割的最新替代方案，展示了TLS作为一种具有竞争力的传感替代方案。此外，结果表明，所提管道的两个阶段均能在没有手动注释的情况下提供可用的3D实例分割，表明其在其他可比的基于TLS的点云分割任务中具有良好的低努力可转移性。

View on arXiv Download PDF AI Translation

cs.CV / 188 / 2603.14316

Direct Object-Level Reconstruction via Probabilistic Gaussian Splatting

通过概率高斯喷溅实现直接物体级重建

Guo, Shuai, Guo, Ao, Zhao, Junchao, Chen, Qi, Qi, Yuxiang, Li, Zechuan, Chen, Dong, Shao, Tianjia, Xu, Mingliang

Abstract

Object-level 3D reconstruction play important roles across domains such as cultural heritage digitization, industrial manufacturing, and virtual reality. However, existing Gaussian Splatting-based approaches generally rely on full-scene reconstruction, in which substantial redundant background information is introduced, leading to increased computational and storage overhead. To address this limitation, we propose an efficient single-object 3D reconstruction method based on 2D Gaussian Splatting. By directly integrating foreground-background probability cues into Gaussian primitives and dynamically pruning low-probability Gaussians during training, the proposed method fundamentally focuses on an object of interest and improves the memory and computational efficiency. Our pipeline leverages probability masks generated by YOLO and SAM to supervise probabilistic Gaussian attributes, replacing binary masks with continuous probability values to mitigate boundary ambiguity. Additionally, we propose a dual-stage filtering strategy for training's startup to suppress background Gaussians. And, during training, rendered probability masks are conversely employed to refine supervision and enhance boundary consistency across views. Experiments conducted on the MIP-360, T&T, and NVOS datasets demonstrate that our method exhibits strong self-correction capability in the presence of mask errors and achieves reconstruction quality comparable to standard 3DGS approaches, while requiring only approximately 1/10 of their Gaussian amount. These results validate the efficiency and robustness of our method for single-object reconstruction and highlight its potential for applications requiring both high fidelity and computational efficiency.

Chinese Translation

物体级3D重建在文化遗产数字化、工业制造和虚拟现实等领域发挥着重要作用。然而，现有基于高斯喷溅的方法通常依赖于全场景重建，这引入了大量冗余的背景信息，导致计算和存储开销增加。为了解决这一限制，我们提出了一种基于2D高斯喷溅的高效单物体3D重建方法。通过将前景-背景概率线索直接整合到高斯原语中，并在训练过程中动态修剪低概率高斯，所提方法根本上聚焦于感兴趣的物体，从而提高了内存和计算效率。我们的流程利用YOLO和SAM生成的概率掩码来监督概率高斯属性，用连续概率值替代二进制掩码，以减轻边界模糊。此外，我们提出了一种双阶段过滤策略用于训练的启动，以抑制背景高斯。在训练过程中，渲染的概率掩码反过来被用来细化监督并增强视图间的边界一致性。在MIP-360、T&T和NVOS数据集上进行的实验表明，我们的方法在存在掩码错误的情况下表现出强大的自我修正能力，并且实现的重建质量可与标准3DGS方法相媲美，同时仅需约1/10的高斯数量。这些结果验证了我们的方法在单物体重建中的效率和鲁棒性，并突显了其在需要高保真度和计算效率的应用中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 189 / 2603.14320

Early Failure Detection and Intervention in Video Diffusion Models

视频扩散模型中的早期故障检测与干预

Byung-Ki, Kwon, Lim, Sohwi, Hyeon-Woo, Nam, Ye-Bin, Moon, Oh, Tae-Hyun

Abstract

Text-to-video (T2V) diffusion models have rapidly advanced, yet generations still occasionally fail in practice, such as low text-video alignment or low perceptual quality. Since diffusion sampling is non-deterministic, it is difficult to know during inference whether a generation will succeed or fail, incurring high computational cost due to trial-and-error regeneration. To address this, we propose an early failure detection and diagnostic intervention pipeline for latent T2V diffusion models. For detection, we design a Real-time Inspection (RI) module that converts latents into intermediate video previews, enabling the use of established text-video alignment scorers for inspection in the RGB space. The RI module completes the conversion and inspection process in just 39.2ms. This is highly efficient considering that CogVideoX-5B requires 4.3s per denoising step when generating a 480p, 49-frame video on an NVIDIA A100 GPU. Subsequently, we trigger a hierarchical and early-exit intervention pipeline only when failure is predicted. Experiments on CogVideoX-5B and Wan2.1-1.3B demonstrate consistency gains on VBench with up to 2.64 times less time overhead compared to post-hoc regeneration. Our method also generalizes to a higher-capacity setting, remaining effective on Wan2.1-14B with 720p resolution and 81-frame generation. Furthermore, our pipeline is plug-and-play and orthogonal to existing techniques, showing seamless compatibility with prompt refinement and sampling guidance methods. We also provide evidence that failure signals emerge early in the denoising process and are detectable within intermediate video previews using standard vision-language evaluators.

Chinese Translation

文本到视频（T2V）扩散模型迅速发展，但在实际应用中生成仍偶尔会失败，例如文本与视频的对齐度低或感知质量差。由于扩散采样是非确定性的，因此在推理过程中很难判断生成是否会成功或失败，这导致由于反复试验而产生高昂的计算成本。为了解决这个问题，我们提出了一种针对潜在 T2V 扩散模型的早期故障检测与诊断干预流程。为了进行检测，我们设计了一个实时检查（Real-time Inspection, RI）模块，该模块将潜在表示转换为中间视频预览，从而能够利用已建立的文本与视频对齐评分器在 RGB 空间中进行检查。RI 模块在仅 39.2 毫秒内完成转换和检查过程。考虑到 CogVideoX-5B 在 NVIDIA A100 GPU 上生成 480p、49 帧视频时每个去噪步骤需要 4.3 秒，这一效率非常高。随后，仅在预测到故障时，我们会触发一个分层的早期退出干预流程。在 CogVideoX-5B 和 Wan2.1-1.3B 上的实验表明，与事后再生相比，我们的方法在 VBench 上的时间开销减少了多达 2.64 倍，且一致性得到了提升。我们的方法还可以推广到更高容量的设置，在 Wan2.1-14B 上以 720p 分辨率和 81 帧生成时仍然有效。此外，我们的流程是即插即用的，并且与现有技术正交，显示出与提示优化和采样指导方法的无缝兼容性。我们还提供了证据表明，故障信号在去噪过程中早期出现，并且可以通过标准视觉-语言评估器在中间视频预览中检测到。

View on arXiv Download PDF AI Translation

cs.CV / 190 / 2603.14321

Personalized Cell Segmentation: Benchmark and Framework for Reference-Guided Cell Type Segmentation

个性化细胞分割：参考引导细胞类型分割的基准与框架

Wang, Bisheng, Cardoso, Jaime S., Wu, Lin

Abstract

Accurate cell segmentation is critical for biological and medical imaging studies. Although recent deep learning models have advanced this task, most methods are limited to generic cell segmentation, lacking the ability to differentiate specific cell types. In this work, we introduce the Personalized Cell Segmentation (PerCS) task, which aims to segment all cells of a specific type given a reference cell. To support this task, we establish a benchmark by reorganizing publicly available datasets, yielding 1,372 images and over 110,000 annotated cells. As a pioneering solution, we propose PerCS-DINO, a framework built on the DINOv2 backbone. By integrating image features and reference embeddings via a cross-attention transformer and contrastive learning, PerCS-DINO effectively segments cells matching the reference. Extensive experiments demonstrate the effectiveness of the proposed PerCS-DINO and highlight the challenges of this new task. We expect PerCS to serve as a useful testbed for advancing research in cell-based applications.

Chinese Translation

准确的细胞分割对于生物学和医学成像研究至关重要。尽管最近的深度学习模型在这一任务上取得了进展，但大多数方法仅限于通用细胞分割，缺乏区分特定细胞类型的能力。在本研究中，我们引入了个性化细胞分割（Personalized Cell Segmentation, PerCS）任务，旨在根据参考细胞对特定类型的所有细胞进行分割。为了支持这一任务，我们通过重新组织公开可用的数据集建立了一个基准，生成了1372幅图像和超过110,000个标注细胞。作为一种开创性解决方案，我们提出了PerCS-DINO框架，该框架基于DINOv2主干。通过跨注意力变换器和对比学习集成图像特征和参考嵌入，PerCS-DINO有效地分割出与参考相匹配的细胞。大量实验表明，所提出的PerCS-DINO的有效性，并突显了这一新任务的挑战。我们期望PerCS能作为推动细胞基础应用研究的有用测试平台。

View on arXiv Download PDF AI Translation

cs.CV / 191 / 2603.14323

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

医疗多模态大语言模型的失败原因是什么？对医疗图像视觉定位的研究

Liu, Guimeng, Yu, Tianze, Ebrahimkhani, Somayeh, Shawn, Lin Zhi Zheng, Ng, Kok Pin, Cheung, Ngai-Man

Abstract

Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance. Additional experiments are included in the Supp.

Chinese Translation

通用多模态大语言模型（MLLMs）在广泛的视觉-语言任务中取得了令人瞩目的表现。然而，它们在医疗任务上的表现，特别是在零样本设置中（此时泛化能力至关重要），仍然不尽如人意。一个关键的研究空白是对医疗MLLMs在医疗图像解读中表现不佳的原因理解有限。在本研究中，我们首次系统性地调查了最先进的医疗MLLMs的视觉定位能力。为了将视觉定位与语义定位区分开来，我们设计了VGMED，这是一个在专家临床指导下开发的新评估数据集，明确评估医疗MLLMs的视觉定位能力。我们引入了新的定量指标，并进行了详细的定性分析。我们的研究涵盖了八个最先进的医疗MLLMs，验证了它们在临床相关图像区域中常常无法将预测结果与之对应。我们注意到，这一发现特定于医疗图像分析；相比之下，先前的研究表明，当应用于自然场景图像时，MLLMs能够在正确的图像区域中定位其预测。基于这些发现，我们提出了VGRefine，这是一种简单而有效的推理时方法，通过优化注意力分布来改善医疗环境中的视觉定位。我们的方法在六个多样化的Med-VQA基准（来自八种成像模式的超过110K VQA样本）上达到了最先进的性能，而无需额外的训练或外部专家模型。总体而言，我们的工作首次系统性地验证了不足的视觉定位是医疗MLLMs表现不佳的关键因素之一。附录中包含了额外的实验。

View on arXiv Download PDF AI Translation

cs.CV / 192 / 2603.14331

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

AvatarForcing：通过局部未来滑动窗口去噪实现一步流式对话头像生成

Cui, Liyuan, Hu, Wentao, Zhang, Wenyuan, Yang, Zesong, Shi, Fan, Liu, Xiaoqiang

Abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/

Chinese Translation

实时对话头像生成需要低延迟和分钟级的时间稳定性。自回归（AR）强制方法支持流式推理，但受到曝光偏差的影响，导致错误在长时间的推理中累积并变得不可逆。相比之下，完整序列扩散变换器可以减轻漂移，但在实时长篇合成中仍然计算开销巨大。我们提出了AvatarForcing，这是一种一步流式扩散框架，通过对具有不同噪声水平的固定局部未来窗口进行去噪，每一步输出一个干净的块，并保持每步的恒定成本。为了稳定无限流，该方法引入了双锚时间强制：一个样式锚点重新索引RoPE，以保持相对于活动窗口的固定相对位置，并应用锚音频零填充；一个时间锚点则重用最近发出的干净块，以确保平滑过渡。通过离线ODE回填和分布匹配的两阶段流式蒸馏，实现了实时一步推理。在标准基准和一个新的400视频长篇基准上的实验显示，使用一个13亿参数的学生模型进行实时流式处理时，视觉质量和唇同步性表现出色，延迟为34毫秒/帧。我们的页面可访问： https://cuiliyuan121.github.io/AvatarForcing/

View on arXiv Download PDF AI Translation

cs.CV / 193 / 2603.14336

UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

UAVBench 和 UAVIT-1M：针对低空无人机视觉-语言理解的多模态大语言模型基准测试与增强

Zhan, Yang, Yuan, Yuan

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)

Chinese Translation

多模态大语言模型（MLLMs）在自然图像和卫星遥感图像方面取得了显著进展。然而，理解低空无人机场景仍然是一个挑战。现有数据集主要集中在少数特定的低空视觉任务上，无法全面评估 MLLMs 在现实世界低空无人机应用中的能力。因此，我们引入了 UAVBench，一个综合基准，以及 UAVIT-1M，一个大规模指令调优数据集，旨在评估和提升 MLLMs 在低空视觉-语言任务中的能力。UAVBench 包含 43 个测试单元和 966k 高质量数据样本，涵盖 10 个图像级和区域级任务。UAVIT-1M 包含约 124 万条多样化指令，涵盖 789k 多场景图像和约 2000 种空间分辨率，涉及 11 个不同任务。UAVBench 和 UAVIT-1M 特征为纯真实世界视觉图像和丰富的天气条件，并涉及人工验证以确保高质量。我们对 11 个最先进的 MLLMs 使用 UAVBench 进行的深入分析表明，开源 MLLMs 无法生成关于低空视觉内容的准确对话，落后于闭源 MLLMs。大量实验表明，在 UAVIT-1M 上对开源 MLLMs 进行微调显著弥补了这一差距。我们的贡献为弥合当前 MLLMs 与低空无人机现实应用需求之间的差距铺平了道路。（项目页面：https://UAVBench.github.io/）

View on arXiv Download PDF AI Translation

cs.CV / 194 / 2603.14337

On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs

关于塑造多模态大语言模型解码策略的注意力汇聚特性

Yoo, Suho, Jang, Youngjoon, Chung, Joon Son

Abstract

Large language models and their multimodal extensions have achieved remarkable success across diverse tasks, yet the internal mechanisms that govern their reasoning behaviour remain partially understood. In particular, the attention sink, a token that attracts disproportionate attention mass, has been observed in transformer architectures, but its role is still unclear. Our goal is to understand what attention sinks represent and how they shape model behaviour during inference, rather than considering them as incidental artifacts. Through our analysis, we find that attention sink representations encode structured global information that influences the decoding process. Building on our findings, we introduce OutRo, a lightweight inference-time strategy that leverages the sink token to enhance contextual representations: (i) non-sink token representations are aligned with the sink representation in the feature space; and (ii) the sink token is allowed to attend beyond the causal constraint, facilitating information exchange with non-sink tokens. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance across representative MLLMs on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.

Chinese Translation

大型语言模型及其多模态扩展在各种任务中取得了显著成功，但支配其推理行为的内部机制仍然部分不明。特别是，注意力汇聚（attention sink）作为一种吸引不成比例注意力的标记，在变换器架构中被观察到，但其作用仍不清晰。我们的目标是理解注意力汇聚所代表的意义，以及它们如何在推理过程中塑造模型行为，而不是将其视为偶然的伪影。通过我们的分析，我们发现注意力汇聚的表示编码了影响解码过程的结构化全局信息。基于我们的发现，我们引入了OutRo，这是一种轻量级的推理时策略，利用汇聚标记增强上下文表示：(i) 非汇聚标记的表示在特征空间中与汇聚表示对齐；(ii) 允许汇聚标记超越因果约束进行关注，从而促进与非汇聚标记的信息交换。该设计在不需要额外前向传递或访问注意力图的情况下增强了推理过程。基于大量实验，OutRo在七个视频问答基准测试中持续提升了代表性多模态大语言模型的性能，并展示了强大的泛化能力，同时仅增加了1.1倍的解码开销。

View on arXiv Download PDF AI Translation

cs.CV / 195 / 2603.14342

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

AgroNVILA：多视角农业多模态大语言模型的感知-推理解耦

Zhang, Jiarui, Hu, Junqi, Mai, Zurong, Chen, Yuhang, Lou, Shuohong, Huang, Henglian, Zhao, Lingyuan, Huang, Jianxi, Lu, Yutong, Fu, Haohuan, Zheng, Juepeng

Abstract

Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.

Chinese Translation

农业多模态推理需要在不同尺度上具备稳健的空间理解能力，从地面特写到自上而下的无人机（UAV）和卫星影像。现有的多模态大语言模型（MLLMs）存在显著的“以地面为中心”的偏差，导致在复杂农业规划过程中出现尺度混淆和逻辑漂移。为了解决这一问题，我们首次引入大规模的AgroOmni（288K），这是一个旨在捕捉现代精准农业中多样空间拓扑和尺度的多视角训练语料库。在此数据集的基础上，我们提出了AgroNVILA，这是一种采用新颖的感知-推理解耦（PRD）架构的MLLM。在感知方面，我们引入了视图条件元网络（VCMN），该网络将宏观空间上下文注入视觉标记中，以最小的计算开销解决尺度歧义。在推理方面，农业感知相对策略优化（ARPO）利用强化学习将模型的决策与专家农业逻辑对齐，防止统计捷径。大量实验表明，AgroNVILA在多高度农业推理中优于最先进的MLLMs，取得了显著的提升（+15.18%），反映出其在整体农业空间规划中的强大能力。

View on arXiv Download PDF AI Translation

cs.CV / 196 / 2603.14361

BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy

BROTHER：通过异构集成正则化优化的行为识别以应对矛盾和犹豫

Pereira, Alexandre, Fernandes, Bruno, Barros, Pablo

Abstract

Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.

Chinese Translation

在自然视频环境中识别复杂的行为状态，如矛盾和犹豫（A/H），仍然是情感计算中的一项重大挑战。与基本的面部表情不同，A/H表现为微妙的多模态冲突，要求深入的上下文和时间理解。本文提出了一种高度正则化的多模态融合管道，以在视频层面预测A/H。我们从视觉、声学和语言数据中提取稳健的单模态特征，并引入一种专门设计的统计文本模态，旨在捕捉时间性语音变化和行为线索。为了识别最有效的表示，我们在一组机器学习分类器（多层感知器（MLP）、随机森林（Random Forest）和梯度提升决策树（GBDT））中评估了15种不同的模态组合，基于验证的二元交叉熵（BCE）损失选择最优的模型。此外，为了在不对训练分布过拟合的情况下最优融合这些异构模型，我们实施了一种粒子群优化（PSO）硬投票集成。PSO适应度函数动态地纳入了训练-验证间隙惩罚（lambda），以主动抑制冗余或过拟合的分类器。我们的综合评估表明，尽管语言特征作为A/H的最强独立预测因子，但我们高度正则化的PSO集成（lambda = 0.2）有效地利用了多模态协同效应，在未见测试集上达到了0.7465的峰值宏F1分数。这些结果强调，将矛盾和犹豫视为一种多模态冲突，通过智能加权的委员会进行评估，为野外行为分析提供了一个稳健的框架。

View on arXiv Download PDF AI Translation

cs.CV / 197 / 2603.14363

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

AerialVLA：一种通过极简端到端控制实现无人机导航的视觉-语言-动作模型

Xu, Peng, Deng, Zhengnan, Deng, Jiayan, Gu, Zonghua, Wan, Shaohua

Abstract

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.

Chinese Translation

无人机（UAV）的视觉-语言导航（VLN）要求在动态三维环境中进行复杂的视觉解读和持续控制。现有的分层方法依赖于密集的神谕指导或辅助物体检测器，这造成了语义差距并限制了真正的自主性。我们提出了AerialVLA，一种极简的端到端视觉-语言-动作框架，将原始视觉观测和模糊语言指令直接映射到连续的物理控制信号。首先，我们引入了一种简化的双视角感知策略，减少了视觉冗余，同时保留了前进导航和精确定位所需的基本线索，这还促进了未来的仿真到现实的转移。为了恢复真正的自主性，我们部署了一种仅基于机载传感器的模糊方向提示机制，完全消除了对密集神谕指导的依赖。最终，我们构建了一个统一的控制空间，将连续的三自由度（3-DoF）运动命令与内在的着陆信号相结合，使得代理能够在不依赖外部物体检测器的情况下实现精确着陆。在TravelUAV基准上的大量实验表明，AerialVLA在已见环境中实现了最先进的性能。此外，在未见场景中，它展现出优越的泛化能力，成功率几乎是领先基线的三倍，验证了极简的以自主为中心的范式比复杂的模块化系统捕捉到更强健的视觉-运动表征。

View on arXiv Download PDF AI Translation

cs.CV / 198 / 2603.14366

Representation Alignment for Just Image Transformers is not Easier than You Think

仅图像变换器的表示对齐并不像你想的那么简单

Shin, Jaeyo, Kim, Jiwook, Shim, Hyunjung

Abstract

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $> 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at https://github.com/kaist-cvml/PixelREPA.

Chinese Translation

表示对齐（REPA）作为一种加速潜在空间中扩散变换器训练的简单方法应运而生。同时，像仅图像变换器（Just image Transformers, JiT）这样的像素空间扩散变换器因其消除了对预训练标记器的依赖，从而避免了潜在扩散的重建瓶颈，受到了越来越多的关注。本文表明，REPA在JiT中可能会失败。随着训练的进行，REPA在JiT上产生了更差的FID，并且在表示空间中紧密聚集的图像子集上导致了多样性的崩溃，这些图像子集是基于ImageNet的预训练语义编码器的结果。我们将这种失败追溯到信息不对称：去噪发生在高维图像空间，而语义目标则被强烈压缩，使得直接回归成为了一种捷径目标。我们提出了PixelREPA，它转变了对齐目标，并通过结合浅层变换器适配器与部分标记掩蔽的Masked Transformer Adapter来约束对齐。PixelREPA改善了训练收敛性和最终质量。对于JiT-B$/16$，PixelREPA将FID从3.66降低到3.17，并在ImageNet $256 imes 256$上将Inception Score（IS）从275.1提高到284.6，同时实现了超过2倍的更快收敛。最后，PixelREPA-H$/16$达到了FID$=1.81$和IS$=317.2$。我们的代码可在https://github.com/kaist-cvml/PixelREPA获取。

View on arXiv Download PDF AI Translation

cs.CV / 199 / 2603.14367

HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

HomeGuard：基于视觉语言模型的具身安全保障，用于识别家庭任务中的情境风险

Lu, Xiaoya, Zhou, Yijin, Chen, Zeren, Wang, Ruocheng, Sima, Bingrui, Zhou, Enshen, Sheng, Lu, Liu, Dongrui, Shao, Jing

Abstract

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard

Chinese Translation

视觉语言模型（VLMs）使具身代理能够执行复杂指令，但它们仍然容易受到情境安全风险的影响，在这些情况下，良性的指令由于微妙的环境状态而变得危险。现有的安全保障措施往往显得不足。基于规则的方法在物体密集场景中缺乏可扩展性，而依赖提示工程的基于模型的方法则存在感知不集中，导致风险遗漏或幻觉。为了解决这个问题，我们提出了一种与架构无关的安全保障机制，采用上下文引导的思维链（Context-Guided Chain-of-Thought, CG-CoT）。该机制将风险评估分解为主动感知，依次将注意力锚定在交互目标和相关空间邻域上，随后基于这些视觉证据进行语义判断。我们通过一个精心策划的基础数据集和利用过程奖励的强化微调（Reinforcement Fine-Tuning, RFT）两阶段训练策略来支持这种方法，以确保精确的中间基础。实验表明，我们的模型HomeGuard显著提高了安全性，与基础模型相比，风险匹配率提高了超过30%，同时减少了过度安全性。除了危险检测，生成的视觉锚点还作为下游规划者的可操作空间约束，促进明确的碰撞避免和安全轨迹生成。代码和数据已发布在 https://github.com/AI45Lab/HomeGuard

View on arXiv Download PDF AI Translation

cs.CV / 200 / 2603.14375

The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics

运动的脉动：从视觉动态中测量物理帧率

Gao, Xiangbo, Wu, Mingyang, Yang, Siyuan, Yu, Jiongze, Taghavi, Pardis, Lin, Fangzhou, Tu, Zhengzhong

Abstract

While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.

Chinese Translation

尽管近期的生成视频模型在视觉现实主义方面取得了显著进展，并被探索作为世界模型，但真正的物理模拟需要同时掌握空间和时间。目前的模型能够生成视觉上平滑的运动学，但缺乏可靠的内部运动脉冲来将这些运动固定在一致的现实时间尺度上。这种时间模糊性源于对具有截然不同现实速度的视频进行不加区分的训练，迫使它们进入标准化的帧率。这导致了我们所称的计时幻觉（chronometric hallucination）：生成的序列表现出模糊、不稳定和不可控的物理运动速度。为了解决这个问题，我们提出了视觉计时器（Visual Chronometer），这是一种直接从输入视频的视觉动态中恢复物理帧每秒（Physical Frames Per Second, PhyFPS）的预测器。通过受控的时间重采样进行训练，我们的方法估计了运动本身所暗示的真实时间尺度，绕过了不可靠的元数据。为了系统地量化这个问题，我们建立了两个基准，PhyFPS-Bench-Real 和 PhyFPS-Bench-Gen。我们的评估揭示了一个严峻的现实：最先进的视频生成器在PhyFPS对齐和时间稳定性方面存在严重问题。最后，我们展示了应用PhyFPS修正显著提高了人类感知的AI生成视频的自然性。我们的项目页面是 https://xiangbogaobarry.github.io/Visual_Chronometer/。

View on arXiv Download PDF AI Translation

cs.CV / 201 / 2603.14377

LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction

LoCAtion：用于高动态范围视频重建的长期协作注意力框架

Zhang, Qianyu, Zheng, Bolun, Zhu, Lingyu, Huang, Aiai, Li, Zongpeng, Wang, Shiqi

Abstract

Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.

Chinese Translation

现有的高动态范围（HDR）视频重建方法在根本上陷入了脆弱的对齐与融合范式。虽然显式的空间对齐可以在受控环境中成功恢复细节，但在不受限制的动态场景中，这成为了一个严重的瓶颈。通过强制在不可预测的运动和变化的曝光下进行刚性对齐，这些方法不可避免地将配准误差转化为严重的重影伪影和时间闪烁。在本文中，我们重新思考了这一传统前提。认识到显式对齐本质上容易受到现实世界复杂性的影响，我们提出了LoCAtion，一个长期协作注意力框架，将HDR视频生成从脆弱的空间扭曲任务重新构建为一个稳健的、无对齐的协作特征路由问题。在这一新框架的指导下，我们的架构显式地解耦了高度纠缠的重建任务。我们不再努力将相邻帧进行刚性扭曲，而是将场景锚定在一个连续的中等曝光主干上，并利用协作注意力动态地收集和注入来自未对齐曝光的可靠辐照度线索。此外，我们引入了一个学习的全局序列求解器。通过利用双向上下文和长范围时间建模，它在整个序列中传播校正信号和结构特征，内在地强制实现整个视频的一致性并消除抖动。大量实验表明，LoCAtion在视觉质量和时间稳定性方面达到了最先进的水平，提供了准确性与计算效率之间的高度竞争平衡。

View on arXiv Download PDF AI Translation

cs.CV / 202 / 2603.14382

StAR: Segment Anything Reasoner

StAR：全景推理分割器

Yun, Seokju, Lee, Dongheon, Bae, Noori, Jun, Jaesung, Cho, Chanseul, Ro, Youngmin

Abstract

As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.

Chinese Translation

随着人工智能系统越来越快速地融入多样且复杂的现实环境，针对隐式查询和图像进行整体推理以定位目标的能力变得愈发重要。然而，近期的推理分割方法未能充分挖掘基础模型的视觉推理能力。在本研究中，我们提出了全景推理分割器（Segment Anything Reasoner, StAR），这是一个全面的框架，从多个角度优化设计空间，包括参数调优方案、奖励函数、学习策略和答案格式，并在近期基准上取得了显著的改进。此外，我们首次成功地将并行测试时间扩展引入到分割任务中，进一步推动了性能边界。为了扩展现有基准所涵盖的推理范围和深度，我们还构建了ReasonSeg-X，该数据集紧凑地定义了推理类型，并包含需要更深层次推理的样本。利用该数据集，我们采用扩展回滚选择性调优的方法训练StAR，以激活基础模型的潜在推理能力，并建立了一个严格的基准，以系统性、细致地评估先进方法。在仅使用5000个训练样本的情况下，StAR在广泛的基准测试中相较于其基础对应模型取得了显著的提升，证明了我们的方法有效地将潜在的推理能力激发出来。

View on arXiv Download PDF AI Translation

cs.CV / 203 / 2603.14409

PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis

PGcGAN：用于人类步态合成的病理步态条件生成对抗网络

Chandrasekaran, Mritula, Kachole, Sanket, Francik, Jarek, Makris, Dimitrios

Abstract

Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.

Chinese Translation

病理步态分析受到有限且变化多端的临床数据集的限制，这限制了对多样化步态障碍的建模。为了解决这一挑战，我们提出了一种病理步态条件生成对抗网络（PGcGAN），该网络直接从观察到的3D姿态关键点轨迹数据合成特定病理的步态序列。该框架在生成器和判别器中都引入了一热编码的病理标签，使得在六种步态类别之间进行受控合成成为可能。生成器采用条件自编码器架构，并通过对抗和重构目标进行训练，以保持步态的结构和时间特征。对病理步态数据集的实验通过主成分分析（PCA）和t-SNE分析、视觉运动学检查以及下游分类任务，展示了真实序列与合成序列之间的强对齐。用合成序列增强真实数据提高了GRU、LSTM和CNN模型的病理步态识别能力，表明病理条件下的步态合成可以有效支持病理步态分析中的数据增强。

View on arXiv Download PDF AI Translation

cs.CV / 204 / 2603.14412

G-ZAP: A Generalizable Zero-Shot Framework for Arbitrary-Scale Pansharpening

G-ZAP：一种可泛化的零-shot框架用于任意尺度的全色锐化

Yang, Zhiqi, Yin, Shan, Liang, Jingze, Deng, Liang-Jian

Abstract

Pansharpening aims to fuse a high-resolution panchromatic (PAN) image and a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) image. Recent deep models have achieved strong performance, yet they typically rely on large-scale pretraining and often generalize poorly to unseen real-world image pairs.Prior zero-shot approaches improve real-scene generalization but require per-image optimization, hindering weight reuse, and the above methods are usually limited to a fixed scale.To address this issue, we propose G-ZAP, a generalizable zero-shot framework for arbitrary-scale pansharpening, designed to handle cross-resolution, cross-scene, and cross-sensor generalization.G-ZAP adopts a feature-based implicit neural representation (INR) fusion network as the backbone and introduces a multi-scale, semi-supervised training scheme to enable robust generalization.Extensive experiments on multiple real-world datasets show that G-ZAP achieves state-of-the-art results under PAN-scale fusion in both visual quality and quantitative metrics.Notably, G-ZAP supports weight reuse across image pairs while maintaining competitiveness with per-pair retraining, demonstrating strong potential for efficient real-world deployment.

Chinese Translation

全色锐化旨在融合高分辨率的全色（PAN）图像和低分辨率的多光谱（LRMS）图像，以生成高分辨率的多光谱（HRMS）图像。近期的深度模型已取得了显著的性能，但它们通常依赖于大规模的预训练，并且在未见过的真实图像对上通常泛化能力较差。先前的零-shot方法提高了真实场景的泛化能力，但需要对每个图像进行优化，限制了权重的重用，并且上述方法通常局限于固定的尺度。为了解决这个问题，我们提出了G-ZAP，一种可泛化的零-shot框架，旨在处理任意尺度的全色锐化，能够应对跨分辨率、跨场景和跨传感器的泛化。G-ZAP采用基于特征的隐式神经表示（INR）融合网络作为骨干，并引入多尺度的半监督训练方案，以实现稳健的泛化。在多个真实世界数据集上的广泛实验表明，G-ZAP在PAN尺度融合下的视觉质量和定量指标上均达到了最先进的结果。值得注意的是，G-ZAP支持跨图像对的权重重用，同时在每对图像重新训练的竞争力下保持竞争力，展示了在真实世界高效部署的强大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 205 / 2603.14416

Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology

Histo-MExNet：一个统一框架用于真实世界、跨放大倍数和可信赖的乳腺癌组织病理学

Taufika, Enam Ahmed, Arafatha, Md Ahasanul, Ghoshb, Abhijit Kumar, Rezab, Md. Tanzim, Alamc, Md Ashad

Abstract

Accurate and reliable histopathological image classification is essential for breast cancer diagnosis. However, many deep learning models remain sensitive to magnification variability and lack interpretability. To address these challenges, we propose Histo-MExNet, a unified framework designed for scaleinvariant and uncertainty-aware classification. The model integrates DenseNet, ConvNeXt, and EfficientNet backbones within a gated multi-expert architecture, incorporates a prototype learning module for example-driven interpretability, and applies physics-informed regularization to enforce morphology preservation and spatial coherence during feature learning. Monte Carlo Dropout is used to quantify predictive uncertainty. On the BreaKHis dataset, Histo-MExNet achieves 96.97% accuracy under multi-magnification training and demonstrates improved generalization to unseen magnification levels compared to single-expert models, while uncertainty estimation helps identify out-of-distribution samples and reduce overconfident errors, supporting a balanced combination of accuracy, robustness, and interpretability for clinical decision support.

Chinese Translation

准确可靠的组织病理图像分类对于乳腺癌诊断至关重要。然而，许多深度学习模型对放大倍数的变化仍然敏感，并且缺乏可解释性。为了解决这些挑战，我们提出了Histo-MExNet，一个旨在实现尺度不变和不确定性感知分类的统一框架。该模型在一个门控多专家架构中集成了DenseNet、ConvNeXt和EfficientNet主干，结合了原型学习模块以实现示例驱动的可解释性，并应用物理信息正则化以在特征学习过程中强制保持形态和空间一致性。蒙特卡洛Dropout被用来量化预测不确定性。在BreaKHis数据集上，Histo-MExNet在多放大倍数训练下实现了96.97%的准确率，并且相比单专家模型在未见放大倍数水平上表现出更好的泛化能力，而不确定性估计有助于识别分布外样本并减少过度自信的错误，从而支持临床决策支持中的准确性、鲁棒性和可解释性的平衡组合。

View on arXiv Download PDF AI Translation

cs.CV / 206 / 2603.14418

Deep EM with Hierarchical Latent Label Modelling for Multi-Site Prostate Lesion Segmentation

基于层次潜在标签建模的深度期望最大化方法用于多站点前列腺病变分割

Yan, Wen, Wang, Yipei, Huang, Shiqi, Thorley, Natasha, Emberton, Mark, Stavrinides, Vasilis, Hu, Yipeng, Barratt, Dean

Abstract

Label variability is a major challenge for prostate lesion segmentation. In multi-site datasets, annotations often reflect centre-specific contouring protocols, causing segmentation networks to overfit to local styles and generalise poorly to unseen sites in inference. We treat each observed annotation as a noisy observation of an underlying latent 'clean' lesion mask, and propose a hierarchical expectation-maximisation (HierEM) framework that alternates between: (1) inferring a voxel-wise posterior distribution over the latent mask, and (2) training a CNN using this posterior as a soft target and estimate site-specific sensitivity and specificity under a hierarchical prior. This hierarchical prior decomposes label-quality into a global mean with site- and case-level deviations, reducing site-specific bias by penalising the likelihood term contributed only by site deviations. Experiments on three cohorts demonstrate that the proposed hierarchical EM framework enhances cross-site generalisation compared to state-of-the-art methods. For pooled-dataset evaluation, the per-site mean DSC ranges from 29.50% to 39.69%; for leave-one-site-out generalisation, it ranges from 27.91% to 32.67%, yielding statistically significant improvements over comparison methods (p<0.039). The method also produces interpretable per-site latent label-quality estimates (sensitivity alpha ranges from 31.5% to 47.3% at specificity beta approximates 0.99), supporting post-hoc analyses of cross-site annotation variability. These results indicate that explicitly modelling site-dependent annotation can improve cross-site generalisation.

Chinese Translation

标签变异性是前列腺病变分割的一大挑战。在多站点数据集中，标注通常反映特定中心的轮廓协议，导致分割网络过拟合于局部风格，并在推断时对未见站点的泛化能力较差。我们将每个观察到的标注视为潜在“干净”病变掩膜的噪声观测，并提出了一种层次期望最大化（HierEM）框架，该框架在以下两个步骤之间交替进行：(1) 推断潜在掩膜的体素级后验分布，(2) 使用该后验作为软目标训练卷积神经网络（CNN），并在层次先验下估计特定站点的灵敏度和特异性。该层次先验将标签质量分解为一个全局均值及站点和病例级的偏差，通过惩罚仅由站点偏差贡献的似然项来减少站点特定的偏倚。对三个队列的实验表明，与最先进的方法相比，所提出的层次EM框架增强了跨站点的泛化能力。对于合并数据集的评估，每个站点的平均Dice相似系数（DSC）范围为29.50%至39.69%；对于留一站点外的泛化，其范围为27.91%至32.67%，在统计上显著优于比较方法（p<0.039）。该方法还生成可解释的每个站点潜在标签质量估计（灵敏度α范围为31.5%至47.3%，特异性β接近0.99），支持对跨站点标注变异性的事后分析。这些结果表明，明确建模站点依赖的标注可以改善跨站点的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 207 / 2603.14426

GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos

GenState-AI：面向AI生成视频的文本到视频检索的状态感知数据集

Li, Minghan, Chen, Tongna, Lv, Tianrui, Zhang, Yishuai, An, Suchao, Zhou, Guodong

Abstract

Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co.

Chinese Translation

现有的文本到视频检索基准主要基于真实世界的视频，这些视频中许多语义信息可以通过单帧推断出来，因此时序推理和明确的最终状态定位的评估不够充分。我们引入了GenState-AI，这是一个以受控状态转换为中心的AI生成基准，其中每个查询与一个主视频、一个仅在决定性最终状态上有所不同的时序硬负样本以及一个内容替换的语义硬负样本配对，能够细致诊断超越外观匹配的时序与语义混淆。使用Wan2.2-TI2V-5B，我们生成了短小的剪辑，其意义依赖于位置、数量和物体关系的精确变化，为状态感知检索提供了可控的评估条件。我们评估了两个具有代表性的基于MLLM的基准，并观察到一致且可解释的失败模式：这两者常常将主视频与时序硬负样本混淆，并过于偏好时序上合理但最终状态不正确的剪辑，表明它们对决定性最终状态证据的基础性依赖不足，同时对语义替代的敏感性相对较低。我们进一步引入基于三元组的诊断分析，包括相对顺序统计和跨转换类别的分解，以明确时序与语义失败的来源。GenState-AI为状态感知、对时间和语义敏感的文本到视频检索提供了一个针对性的测试平台，并将发布在huggingface.co上。

View on arXiv Download PDF AI Translation

cs.CV / 208 / 2603.14435

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

端到端时空变换器用于实时4D人机交互重建

Zhang, Haoyu, Zhai, Wei, Yang, Yuhang, Cao, Yang, Zha, Zheng-Jun

Abstract

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/

Chinese Translation

单目4D人机交互（HOI）重建——从单个RGB视频中恢复移动的人和被操控的物体——由于深度模糊和频繁遮挡而仍然具有挑战性。现有方法通常依赖于多阶段管道或迭代优化，导致高推理延迟，无法满足实时要求，并且容易出现误差累积。为了解决这些局限性，我们提出了THO，一种端到端的时空变换器，能够从给定的视频和3D模板中以前向方式预测人类运动和协调的物体运动。THO通过利用时空HOI元组先验来实现这一目标。空间先验利用接触区域的接近性，从人类线索中推断被遮挡物体的特征，而时间先验则捕捉跨帧的运动学相关性，以细化物体表示并强制物理一致性。大量实验表明，THO在单个RTX 4090 GPU上以31.5 FPS的推理速度运行，相比于先前基于优化的方法实现了超过600倍的加速，同时提高了重建精度和时间一致性。项目页面可访问：https://nianheng.github.io/THO-project/

View on arXiv Download PDF AI Translation

cs.CV / 209 / 2603.14452

Uni-MDTrack: Learning Decoupled Memory and Dynamic States for Parameter-Efficient Visual Tracking in All Modality

Uni-MDTrack：学习解耦记忆和动态状态以实现各模态的参数高效视觉跟踪

Cai, Wenrui, Lu, Zhenyi, Li, Yuzhe, Feng, Yongchao, Zhang, Jinqing, Liu, Qingjie, Wang, Yunhong

Abstract

With the advent of Transformer-based one-stream trackers that possess strong capability in inter-frame relation modeling, recent research has increasingly focused on how to introduce spatio-temporal context. However, most existing methods rely on a limited number of historical frames, which not only leads to insufficient utilization of the context, but also inevitably increases the length of input and incurs prohibitive computational overhead. Methods that query an external memory bank, on the other hand, suffer from inadequate fusion between the retrieved spatio-temporal features and the backbone. Moreover, using discrete historical frames as context overlooks the rich dynamics of the target. To address the issues, we propose Uni-MDTrack, which consists of two core components: Memory-Aware Compression Prompt (MCP) module and Dynamic State Fusion (DSF) module. MCP effectively compresses rich memory features into memory-aware prompt tokens, which deeply interact with the input throughout the entire backbone, significantly enhancing the performance while maintaining a stable computational load. DSF complements the discrete memory by capturing the continuous dynamic, progressively introducing the updated dynamic state features from shallow to deep layers, while also preserving high efficiency. Uni-MDTrack also supports unified tracking across RGB, RGB-D/T/E, and RGB-Language modalities. Experiments show that in Uni-MDTrack, training only the MCP, DSF, and prediction head, keeping the proportion of trainable parameters around 30%, yields substantial performance gains, achieves state-of-the-art results on 10 datasets spanning five modalities. Furthermore, both MCP and DSF exhibit excellent generality, functioning as plug-and-play components that can boost the performance of various baseline trackers, while significantly outperforming existing parameter-efficient training approaches.

Chinese Translation

随着基于Transformer的一流跟踪器的出现，它们在帧间关系建模方面具备强大的能力，近期的研究越来越关注如何引入时空上下文。然而，大多数现有方法依赖于有限数量的历史帧，这不仅导致上下文利用不足，还不可避免地增加了输入长度并带来高昂的计算开销。另一方面，查询外部记忆库的方法在检索的时空特征与主干网络之间的融合不足。此外，使用离散历史帧作为上下文忽视了目标的丰富动态。为了解决这些问题，我们提出了Uni-MDTrack，该方法由两个核心组件组成：记忆感知压缩提示（Memory-Aware Compression Prompt, MCP）模块和动态状态融合（Dynamic State Fusion, DSF）模块。MCP有效地将丰富的记忆特征压缩为记忆感知提示令牌，这些令牌在整个主干网络中与输入深度交互，显著提升了性能，同时保持稳定的计算负载。DSF通过捕捉连续动态来补充离散记忆，逐步将更新的动态状态特征从浅层引入到深层，同时保持高效率。Uni-MDTrack还支持RGB、RGB-D/T/E和RGB-语言模态的统一跟踪。实验表明，在Uni-MDTrack中，仅训练MCP、DSF和预测头，保持可训练参数比例在30%左右，便可获得显著的性能提升，在涵盖五种模态的10个数据集上实现了最先进的结果。此外，MCP和DSF均表现出优异的通用性，作为即插即用组件能够提升各种基线跟踪器的性能，同时显著超越现有的参数高效训练方法。

View on arXiv Download PDF AI Translation

cs.CV / 210 / 2603.14468

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

LongVidSearch：用于长视频中的多跳证据检索规划的代理基准

Yu, Rongyi, Duan, Chenyuan, Zhang, Wentao

Abstract

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent's ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

Chinese Translation

长视频问答（Long-Video QA）越来越依赖于代理工具的使用，以从长视频中检索证据。在现实环境中，这一过程通常需要多跳检索，代理必须迭代地收集多个不连续的证据片段。然而，现有的长视频基准大多是静态的：它们很少强制执行严格的多跳检索，并且通常缺乏标准化的证据访问接口，这使得难以区分检索规划中的失败与答案生成中的失败。为了解决这一问题，我们引入了LongVidSearch，这是一个用于在标准化访问约束下评估长视频中代理多跳证据检索规划的基准。LongVidSearch 强制执行检索必要性：一个 Hop-k 问题需要恰好 k 个必要的证据片段，移除任何单个片段都会使问题无法解决。该基准包含 3,000 个问题，涵盖 447 个长视频（平均长度 26 分钟），涉及四个推理类别：状态变更、因果推理、全局总结和视觉跟踪，具有 2 跳、3 跳和 4 跳的证据要求。为了确保公平和可控的评估，所有代理通过统一的工具接口与 LongVidSearch 进行交互，该接口固定了检索后端，并隔离了代理制定查询和规划迭代检索的能力。除了答案准确性外，我们还测量工具调用成本，以分析在相同访问条件下的准确性与效率的权衡。我们使用三位评审的多数投票评估 VideoAgent 风格的 QA 代理，结合多个主干大型语言模型（LLMs）。GPT-5 达到了最高准确率（42.43），超越了 Gemini 3 Pro（30.97）和 GPT-4o（19.20），但仍未超过 50%，突显了多跳检索规划的困难。使用黄金证据片段时，性能接近完美，确认了检索规划是主要瓶颈。

View on arXiv Download PDF AI Translation

cs.CV / 211 / 2603.14475

Wi-Spike: A Low-power WiFi Human Multi-action Recognition Model with Spiking Neural Networks

Wi-Spike：一种基于脉冲神经网络的低功耗WiFi人类多动作识别模型

Zhang, Nengbo, Ying, Yao, Wang, Lu, Wu, Kaishun, Ma, Jieming, Luo, Fei

Abstract

WiFi-based human action recognition (HAR) has gained significant attention due to its non-intrusive and privacy-preserving nature. However, most existing WiFi sensing models predominantly focus on improving recognition accuracy, while issues of power consumption and energy efficiency remain insufficiently discussed. In this work, we present Wi-Spike, a bio-inspired spiking neural network (SNN) framework for efficient and accurate action recognition using WiFi channel state information (CSI) signals. Specifically, leveraging the event-driven and low-power characteristics of SNNs, Wi-Spike introduces spiking convolutional layers for spatio-temporal feature extraction and a novel temporal attention mechanism to enhance discriminative representation. The extracted features are subsequently encoded and classified through spiking fully connected layers and a voting layer. Comprehensive experiments on three benchmark datasets (NTU-Fi-HAR, NTU-Fi-HumanID, and UT-HAR) demonstrate that Wi-Spike achieves competitive accuracy in single-action recognition and superior performance in multi-action recognition tasks. As for energy consumption, Wi-Spike reduces the energy cost by at least half compared with other methods, while still achieving 95.83% recognition accuracy in human activity recognition. More importantly, Wi-Spike establishes a new state-of-the-art in WiFi-based multi-action HAR, offering a promising solution for real-time, energy-efficient edge sensing applications.

Chinese Translation

基于WiFi的人类动作识别（HAR）因其非侵入性和隐私保护特性而受到广泛关注。然而，现有的大多数WiFi感知模型主要集中在提高识别准确性上，而对功耗和能效问题的讨论则相对不足。在本研究中，我们提出了Wi-Spike，一种生物启发的脉冲神经网络（SNN）框架，用于通过WiFi信道状态信息（CSI）信号进行高效且准确的动作识别。具体而言，Wi-Spike利用SNN的事件驱动和低功耗特性，引入了脉冲卷积层用于时空特征提取，并采用了一种新颖的时间注意机制以增强区分性表示。提取的特征随后通过脉冲全连接层和投票层进行编码和分类。在三个基准数据集（NTU-Fi-HAR、NTU-Fi-HumanID和UT-HAR）上的全面实验表明，Wi-Spike在单动作识别中实现了具有竞争力的准确性，并在多动作识别任务中表现优越。就能耗而言，Wi-Spike的能耗成本至少减少了一半，与其他方法相比，在人类活动识别中仍实现了95.83%的识别准确率。更重要的是，Wi-Spike在基于WiFi的多动作HAR中建立了新的最先进水平，为实时、节能的边缘感知应用提供了有前景的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 212 / 2603.14482

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

V-JEPA 2.1：解锁视频自监督学习中的密集特征

Mur-Labadia, Lorenzo, Muckley, Matthew, Bar, Amir, Assran, Mido, Sinha, Koustuv, Rabbat, Mike, LeCun, Yann, Ballas, Nicolas, Bardes, Adrien

Abstract

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

Chinese Translation

我们提出了 V-JEPA 2.1，这是一系列自监督模型，旨在为图像和视频学习密集、高质量的视觉表征，同时保持强大的全局场景理解。该方法结合了四个关键组件。首先，密集预测损失使用基于掩码的目标，其中可见和被掩码的标记都对训练信号产生贡献，从而鼓励显式的空间和时间基础。其次，深度自监督在多个中间编码器层上分层应用自监督目标，以提高表征质量。第三，多模态标记器使得图像和视频的统一训练成为可能。最后，该模型在模型容量和训练数据方面都受益于有效的扩展。这些设计选择共同产生了空间结构化、语义连贯和时间一致的表征。从实证上看，V-JEPA 2.1 在多个具有挑战性的基准测试中实现了最先进的性能，包括在 Ego4D 上的短期物体交互预测达到 7.71 mAP，以及在 EPIC-KITCHENS 上的高层次动作预测达到 40.8 Recall@5，此外，相较于 V-JEPA-2 AC，在真实机器人抓取成功率上提高了 20 个百分点。该模型在机器人导航（TartanDrive 上的 5.687 ATE）、深度估计（在 NYUv2 上使用线性探针的 0.307 RMSE）和全局识别（在 Something-Something-V2 上的 77.7）方面也表现出色。这些结果表明，V-JEPA 2.1 在密集视觉理解和世界建模方面显著推动了技术的进步。

View on arXiv Download PDF AI Translation

cs.CV / 213 / 2603.14493

Fine-tuning MLLMs Without Forgetting Is Easier Than You Think

不遗忘的微调多模态大型语言模型比你想象的更简单

Li, He, Zhang, Yuhui, Wang, Xiaohan, Lyu, Kaifeng, Yeung-Levy, Serena

Abstract

The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.

Chinese Translation

本文展示了对多模态大型语言模型（MLLM）微调方案的简单调整足以减轻灾难性遗忘。在视觉问答任务中，我们设计了一个2x2实验框架，以评估模型在分布内和分布外图像及文本输入上的表现。我们的结果表明，适当的正则化措施，例如限制可训练参数的数量或采用较低的学习率，在处理分布外图像时有效防止遗忘。然而，我们发现，在分布内图像和分布外文本的设置中存在一种明显的遗忘形式。我们将这种遗忘归因于任务特定的过拟合，并通过引入一种数据混合训练策略来解决这一问题，该策略结合了不同的数据集和任务。最后，我们证明这种方法自然扩展到持续学习，超越了现有的复杂辅助机制的方法。总体而言，我们的发现挑战了普遍的假设，突显了MLLM的内在鲁棒性，并提供了在保持其通用能力的同时进行适应的实用指南。

View on arXiv Download PDF AI Translation

cs.CV / 214 / 2603.14496

Refining 3D Medical Segmentation with Verbal Instruction

通过语言指令优化三维医学分割

Xie, Kangxian, Yang, Jiancheng, Pinter, Nandor, Wu, Chao, Bozorgtabar, Behzad, Gao, Mingchen

Abstract

Accurate 3D anatomical segmentation is essential for clinical diagnosis and surgical planning. However, automated models frequently generate suboptimal shape predictions due to factors such as limited and imbalanced training data, inadequate labeling quality, and distribution shifts between training and deployment settings. A natural solution is to iteratively refine the predicted shape based on the radiologists' verbal instructions. However, this is hindered by the scarcity of paired data that explicitly links erroneous shapes to corresponding corrective instructions. As an initial step toward addressing this limitation, we introduce CoWTalk, a benchmark comprising 3D arterial anatomies with controllable synthesized anatomical errors and their corresponding repairing instructions. Building on this benchmark, we further propose an iterative refinement model that represents 3D shapes as vector sets and interacts with textual instructions to progressively update the target shape. Experimental results demonstrate that our method achieves significant improvements over corrupted inputs and competitive baselines, highlighting the feasibility of language-driven clinician-in-the-loop refinement for 3D medical shapes modeling.

Chinese Translation

准确的三维解剖分割对于临床诊断和手术规划至关重要。然而，自动化模型由于训练数据有限且不平衡、标注质量不足以及训练与部署环境之间的分布变化等因素，常常生成次优的形状预测。一个自然的解决方案是根据放射科医生的语言指令迭代优化预测的形状。然而，这一过程受到缺乏明确将错误形状与相应修正指令关联的配对数据的限制。作为解决这一限制的初步步骤，我们引入了CoWTalk，一个基准数据集，包含可控合成解剖错误的三维动脉解剖结构及其对应的修复指令。在此基准的基础上，我们进一步提出了一种迭代优化模型，该模型将三维形状表示为向量集，并与文本指令交互，以逐步更新目标形状。实验结果表明，我们的方法在处理损坏输入和竞争基线方面取得了显著改善，突显了基于语言的临床医生参与的三维医学形状建模优化的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 215 / 2603.14497

WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

WorldVLM：结合世界模型预测与视觉-语言推理

Englmeier, Stefan, Winter, Katharina, Flohr, Fabian B.

Abstract

Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.

Chinese Translation

自主驾驶系统依赖于能够推理高层场景上下文并准确预测周围环境动态的模型。视觉-语言模型（VLMs）最近作为决策制定和场景理解的有希望工具出现，提供了强大的上下文推理能力。然而，它们有限的空间理解能力限制了其作为端到端驾驶模型的有效性。世界模型（WM）内化环境动态以预测未来场景演变。最近被探索作为自我运动预测器和自主驾驶的基础模型，它们代表了应对该领域关键挑战的有希望方向，特别是在保持动态预测的同时增强泛化能力。为了利用基于上下文的决策制定和预测的互补优势，我们提出了WorldVLM：一种统一VLM和WM的混合架构。在我们的设计中，高层VLM生成行为指令以指导驾驶WM，从而实现可解释和上下文感知的动作。我们评估了条件策略，并提供了对混合设计挑战的见解。

View on arXiv Download PDF AI Translation

cs.CV / 216 / 2603.14503

Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models

通过物理引导的扩散模型映射暗物质团簇

Royo, Diego, Zhao, Brandon, Muñoz, Adolfo, Gutierrez, Diego, Bouman, Katherine L.

Abstract

Galaxy clusters are powerful probes of astrophysics and cosmology through gravitational lensing: the clusters' mass, dominated by 85% dark matter, distorts background light. Yet, mass reconstruction lacks the scalability and large-scale benchmarks to process the hundreds of thousands of clusters expected from forthcoming wide-field surveys. We introduce a fully automated method to reconstruct cluster surface mass density from photometry and gravitational lensing observables. Central to our approach is DarkClusters-15k, our new dataset of 15,000 simulated clusters with paired mass and photometry maps, the largest benchmark to date, spanning multiple redshifts and simulation frameworks. We train a plug-and-play diffusion prior on DarkClusters-15k that learns the statistical relationship between mass and light, and draw posterior samples constrained by weak- and strong-lensing observables; this yields principled reconstructions driven by explicit physics, alongside well-calibrated uncertainties. Our approach requires no expert tuning, runs in minutes rather than hours, achieves higher accuracy, and matches expertly-tuned reconstructions of the MACS 1206 cluster. We release our method and DarkClusters-15k to support development and benchmarking for upcoming wide-field cosmological surveys.

Chinese Translation

星系团是通过引力透镜效应研究天体物理学和宇宙学的强有力探针：星系团的质量中85%由暗物质主导，扭曲了背景光。然而，质量重建缺乏可扩展性和大规模基准，无法处理预计来自即将到来的广域调查的数十万个星系团。我们提出了一种完全自动化的方法，从光度和引力透镜观测中重建星系团的表面质量密度。我们方法的核心是DarkClusters-15k，这是我们新创建的包含15,000个模拟星系团的数据库，配有质量和光度图，是迄今为止最大的基准数据集，涵盖多个红移和模拟框架。我们在DarkClusters-15k上训练了一个即插即用的扩散先验，学习质量与光之间的统计关系，并根据弱透镜和强透镜观测绘制后验样本；这产生了由明确物理驱动的原则性重建，并伴随良好校准的不确定性。我们的方法无需专家调优，运行时间为几分钟而非几小时，达到了更高的准确性，并与经过专家调优的MACS 1206星系团的重建结果相匹配。我们发布了我们的方法和DarkClusters-15k，以支持即将到来的广域宇宙学调查的开发和基准测试。

View on arXiv Download PDF AI Translation

cs.CV / 217 / 2603.14505

Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

解锁潜在的画布：在大型语言模型中引发和基准化符号视觉表达

Zheng, Yiren, Li, Shibo, Liu, Jiaming, Wang, Haofan, Song, Yiren

Abstract

Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.

Chinese Translation

当前的多模态方法主要将视觉生成视为一个外部过程，依赖于像素渲染或代码执行，因而忽视了大型语言模型（LLMs）中潜在的原生视觉表征能力。在本研究中，我们通过ASCII艺术这一紧凑、高效且以文本为基础的视觉格式来解锁这一潜力。我们提出了SVE-ASCII，一个统一框架，旨在直接在纯文本空间中引发和基准化符号视觉表达。为了应对系统性资源的匮乏，我们构建了ASCIIArt-7K，这是一个通过新颖的“种子与进化”流程合成的高质量数据集，该流程通过上下文中的风格编辑增强人类策划的锚点。我们进一步实施了一种统一的指令调优策略，联合优化生成（文本到ASCII）与理解（ASCII到文本）。重要的是，我们的实验揭示了任务双重性相关的关键现象：虽然已建立感知有助于生成的观点，但我们提供了有力证据表明生成训练显著增强了视觉理解。这确认了符号视觉处理中的相互增强循环，这一关系以前被假设但在视觉领域中很少有实证研究加以证明。我们发布了我们的数据集、ASCIIArt-Bench基准和SVE-ASCII模型，为基于文本的原生视觉智能建立了一个稳健的基线。

View on arXiv Download PDF AI Translation

cs.CV / 218 / 2603.14507

Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets

利用未标记数据和LiDAR数据集扩展毫米波数据集以进行人体姿态估计

Peng, Zhuoxuan, Zhu, Boan, Zhang, Xingjian, Li, Wenying, Chan, S. -H. Gary

Abstract

Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudo-label estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.

Chinese Translation

当前用于人体姿态估计（HPE）的毫米波数据集稀缺，且在点云（PC）属性和人体姿态方面缺乏多样性，严重阻碍了其训练模型的泛化能力。另一方面，未标记的毫米波HPE数据和多样化的LiDAR HPE数据集随处可得。我们提出了一种新方法EMDUL，通过利用未标记的毫米波数据和LiDAR数据集来扩展现有毫米波数据集的数量和多样性。EMDUL训练一个伪标签估计器来标注未标记的毫米波数据，并能够将给定的标注LiDAR点云转换或翻译为其毫米波对应物。通过结合LiDAR转换的点云和伪标记的毫米波点云扩展，我们的毫米波数据集显著提升了所有HPE模型的性能和泛化能力，在领域内和领域外设置中分别实现了15.1%和18.9%的误差降低。

View on arXiv Download PDF AI Translation

cs.CV / 219 / 2603.14523

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

VLA-Thinker：通过图像推理增强视觉-语言-行动模型

Wang, Chaoyang, Bao, Wenrui, Gao, Sicheng, Xu, Bingxin, Tian, Yu, Rawat, Yogesh S., Ge, Yunhao, Shang, Yuzhang

Abstract

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

Chinese Translation

视觉-语言-行动（VLA）模型在具身智能方面展现出良好的潜力，但目前大多数方法依赖于基于文本的思维链推理，在这些方法中，视觉输入被视为静态上下文。这限制了模型在长时间任务中主动回访环境和解决歧义的能力。我们提出了VLA-Thinker，旨在通过图像推理的思维框架将感知建模为可动态调用的推理动作。为了训练这样一个系统，我们引入了一种两阶段的训练管道，包括（1）一个SFT（Supervised Fine-Tuning）冷启动阶段，使用策划的视觉思维链数据以激活结构化推理和工具使用行为，以及（2）基于GRPO（Goal-Conditioned Reinforcement Learning with Perceptual Objectives）的方法进行强化学习，以使完整的推理-动作轨迹与任务级成功对齐。在LIBERO和RoboTwin 2.0基准上的大量实验表明，VLA-Thinker显著提高了操作性能，LIBERO上的成功率达97.5%，并在长时间机器任务上取得了显著提升。项目及代码链接：https://cywang735.github.io/VLA-Thinker/ 。

View on arXiv Download PDF AI Translation

cs.CV / 220 / 2603.14526

LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

LatSearch：基于潜在奖励引导的搜索以加速视频扩散中的推理时间缩放

Zhao, Zengqun, Liu, Ziquan, Cao, Yu, Gong, Shaogang, Zhang, Zhensong, Song, Jifei, Deng, Jiankang, Patras, Ioannis

Abstract

The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.

Chinese Translation

大型语言模型在推理时间缩放方面的近期成功激发了对视频扩散领域的类似探索。特别是，受到增强视频质量的“黄金噪声”存在的启发，先前的研究尝试通过优化或搜索更好的初始噪声来改善推理。然而，这些方法存在显著的局限性：它们要么依赖于在噪声采样开始时施加的先验，要么依赖于仅在去噪和解码后的视频上评估的奖励。这导致了误差累积、延迟和稀疏的奖励信号，以及高昂的计算成本，阻碍了更强大搜索算法的使用。至关重要的是，更强大的搜索算法正是能够为视频扩散解锁可控性、样本效率和生成质量的显著提升，前提是可以降低其计算成本。为填补这一空白，我们通过潜在奖励引导实现了视频扩散的高效推理时间缩放，该方法在去噪轨迹中提供中间的、信息丰富的和高效的反馈。我们引入了一种潜在奖励模型，该模型根据视觉质量、运动质量和文本对齐度对任意时间步的部分去噪潜在变量进行评分。在此模型的基础上，我们提出了LatSearch，一种新颖的推理时间搜索机制，执行奖励引导重采样和修剪（Reward-Guided Resampling and Pruning，RGRP）。在重采样阶段，候选者根据奖励归一化概率进行采样，以减少对奖励模型的过度依赖。在修剪阶段，在最终调度步骤中，仅保留累积奖励最高的候选者，从而提高质量和效率。我们在VBench-2.0基准上评估了LatSearch，并证明与基线Wan2.1模型相比，它在多个评估维度上始终改善视频生成。

View on arXiv Download PDF AI Translation

cs.CV / 221 / 2603.14528

Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events

Interp3R：基于帧和事件的连续时间三维几何估计

Guo, Shuang, Febryanto, Filbert, Sun, Lei, Gallego, Guillermo

Abstract

In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.

Chinese Translation

近年来，由基于点图的方法（如DUSt3R）开创的三维视觉基础模型引起了广泛关注，在多样场景中实现了令人印象深刻的准确性和强大的泛化能力。然而，这些方法本质上仅限于在图像捕获的离散时间点恢复场景几何，导致在连续帧之间的盲时间内场景演变大部分未被探索。我们提出了Interp3R，尽我们所知，这是第一个增强基于点图模型的方法，能够在任意时间点估计深度和相机姿态。Interp3R利用异步事件数据对基于帧的模型生成的点图进行插值，从而实现时间上连续的几何表示。然后，通过将插值后的点图与基础基于帧的模型预测的点图对齐到一个一致的空间框架中，联合恢复深度和相机姿态。我们仅在一个合成数据集上训练Interp3R，但在广泛的合成和真实世界基准测试中展示了强大的泛化能力。大量实验表明，Interp3R在很大程度上超越了采用二维视频帧插值后进行三维几何估计的两阶段流程的最先进基线。

View on arXiv Download PDF AI Translation

cs.CV / 222 / 2603.14536

Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

提炼潜在流形：通过变分自编码器进行分辨率外推

Chu, Jiaming, Wang, Tao, Jin, Lei

Abstract

Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution preference.We further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.

Chinese Translation

变分自编码器（Variational Autoencoder, VAE）编码器在现代生成模型中扮演着关键角色，但其计算成本常常促使人们使用知识蒸馏或量化来获得紧凑的替代方案。现有研究通常认为，模型在接近其训练数据分布的样本上表现更好，而在未见数据分布上表现较差。在本研究中，我们报告了VAE编码器蒸馏中的一个反直觉现象：仅在低分辨率下蒸馏的紧凑编码器在其原生分辨率下表现出较差的重建性能，但在更高的、未见的输入分辨率下评估时却取得了显著改善。尽管从未在超过$256^2$的分辨率上训练，该蒸馏编码器仍能有效地推广到$512^2$的输入分辨率，部分继承了教师模型的分辨率偏好。我们进一步分析了不同分辨率下的潜在分布，发现高分辨率输入生成的潜在表示与教师的流形更加一致。通过在ImageNet-256上的大量实验，我们表明简单的分辨率重映射——在编码之前对输入进行上采样，并在评估时对重建结果进行下采样——在PSNR、MSE、SSIM、LPIPS和rFID指标上带来了显著的提升。这些发现表明，VAE编码器蒸馏学习的是分辨率一致的潜在流形，而不是特定于分辨率的像素映射。这也意味着，在内存、时间和高分辨率数据集上的高训练成本并不是蒸馏具有高分辨率图像重建能力的VAE的必要条件。在低分辨率数据集上，蒸馏模型仍然能够学习教师模型在高分辨率图像重建中的详细知识。

View on arXiv Download PDF AI Translation

cs.CV / 223 / 2603.14549

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

ASAP：关注转移感知剪枝用于高效的LVLM推理

Pathak, Surendra, Han, Bo

Abstract

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

Chinese Translation

尽管大型视觉语言模型（LVLMs）展现了卓越的多模态能力，但处理高分辨率视觉标记的二次计算成本仍然是一个关键瓶颈。尽管最近的标记减少策略试图加速推理，但这些方法未能充分利用注意力值，并未解决标记冗余问题。更为重要的是，它们忽视了LVLMs固有的“注意力转移”现象，这会扭曲标记的注意力得分。在本研究中，我们提出了ASAP，一种新颖的无训练、兼容KV缓存的剪枝方案，全面解决了这些局限性。首先，我们通过利用动态双向软注意力掩码来减轻注意力转移，确保选择真正有信息量的标记，而不是简单的基于注意力的选择。其次，我们认为标记集中的高语义冗余会降低性能。因此，我们引入了加权软合并组件，合并语义相似的标记，仅保留最具特征密度的视觉块供后续层使用。ASAP实现了视觉上下文的几乎无损压缩，保留了99.02%的原始LLaVA-NeXT-7B性能，同时将计算FLOPs大幅削减约80%。

View on arXiv Download PDF AI Translation

cs.CV / 224 / 2603.14559

A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy

一种全面的多模态数据集及其在内镜下溃疡性结肠炎评分中的基准测试

Ghatwary, Noha, Yue, Jiangbei, Elgendy, Ahmed, Nagdy, Hanna, Galal, Ahmed, Fathy, Hayam, El-Amin, Hussein, Subramanian, Venkataraman, Mohammed, Noor, Ochoa-Ruiz, Gilberto, Ali, Sharib

Abstract

Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.

Chinese Translation

溃疡性结肠炎（UC）是一种慢性粘膜炎症性疾病，使患者面临更高的结直肠癌风险。结肠镜监测仍然是评估疾病活动性的金标准，报告通常依赖于标准化的内镜评分指标。最广泛使用的是梅奥内镜评分（Mayo Endoscopic Score, MES），一些中心还采用溃疡性结肠炎内镜严重指数（Ulcerative Colitis Endoscopic Index of Severity, UCEIS）。这两种评分都是对粘膜炎症的描述性评估（MES: 0到3；UCEIS: 0到8），较高的值表示疾病更为严重。然而，自动预测这些评分的计算方法仍然有限，主要是由于缺乏公开可用的专家注释数据集以及缺乏稳健的基准测试。尽管图像描述是一个成熟的计算机视觉任务，但在生成具有临床意义的UC图像描述方面仍存在显著的研究空白。不同中心的内镜系统和操作流程的差异进一步突显了多中心数据集的必要性，以确保算法的稳健性和普适性。在本研究中，我们介绍了一个经过精心策划的多中心、多分辨率数据集，其中包括经过专家验证的MES和UCEIS标签，以及详细的临床描述。据我们所知，这是第一个结合双重评分指标用于分类任务的全面数据集，并包含描述粘膜外观的专家生成的标题和临床认可的图像描述理由。该资源为开发具有临床意义的多模态算法开辟了新的机会。除了数据集外，我们还提供了基于卷积神经网络、视觉变换器、混合模型和广泛使用的多模态视觉-语言图像描述算法的基准测试。

View on arXiv Download PDF AI Translation

cs.CV / 225 / 2603.14579

Medical Image Spatial Grounding with Semantic Sampling

医学图像空间定位与语义采样

Yu, Andrew Seohwan, Hariri, Mohsen, Nakamura, Kunio, Yang, Mingrui, Li, Xiaojuan, Chaudhary, Vipin

Abstract

Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce \textbf{MIS-Ground}, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of \textbf{M}edical \textbf{I}mage \textbf{S}patial \textbf{Ground}ing. We release MIS-Ground to the public at \href{https://anonymous.4open.science/r/mis-ground}{\texttt{anonymous.4open.science/r/mis-ground}}. In addition, we present \textbf{MIS-SemSam}, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of \textbf{Sem}antic \textbf{Sam}pling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06\%.

Chinese Translation

视觉语言模型（VLMs）在图像和视频的视觉定位方面显示出了显著的潜力。在医学成像研究中，VLMs代表了物体检测与分割之间的桥梁，并报告理解与生成。然而，在医学图像的三维空间中对解剖结构进行空间定位面临许多独特的挑战。在本研究中，我们考察了图像模态、切片方向和坐标系统作为VLM视觉组件的区分因素，以及解剖、方向和关系术语的使用作为语言组件的因素。然后，我们展示了视觉和文本提示系统（如标签、边界框和掩膜叠加）对VLM空间定位能力的影响各不相同。为了实现测量和可重复性，我们引入了 extbf{MIS-Ground}，这是一个全面测试VLM在特定 extbf{M}edical extbf{I}mage extbf{S}patial extbf{Ground}ing模式下的脆弱性的基准。我们将MIS-Ground公开发布，地址为 exttt{anonymous.4open.science/r/mis-ground}。此外，我们提出了 extbf{MIS-SemSam}，这是一种低成本、推理时的、模型无关的VLM优化方法，通过使用 extbf{Sem}antic extbf{Sam}pling来提高其空间定位能力。我们发现，MIS-SemSam使Qwen3-VL-32B在MIS-Ground上的准确率提高了13.06%。

View on arXiv Download PDF AI Translation

cs.CV / 226 / 2603.14587

Texel Splatting: Perspective-Stable 3D Pixel Art

纹素喷溅：视角稳定的3D像素艺术

Ebert, Dylan

Abstract

Rendering 3D scenes as pixel art requires that discrete pixels remain stable as the camera moves. Existing methods snap the camera to a grid. Under orthographic projection, this works: every pixel shifts by the same amount, and a single snap corrects all of them. Perspective breaks this. Pixels at different depths drift at different rates, and no single snap corrects all depths. Texel splatting avoids this entirely. Scene geometry is rendered into a cubemap from a fixed point in the world, and each texel is splatted to the screen as a world-space quad. Cubemap indexing gives rotation invariance. Grid-snapping the origin gives translation invariance. The primary limitation is that a fixed origin cannot see all geometry; disocclusion at probe boundaries remains an open tradeoff.

Chinese Translation

将3D场景渲染为像素艺术要求在相机移动时离散像素保持稳定。现有方法将相机固定在网格上。在正交投影下，这种方法有效：每个像素以相同的量移动，单次固定可以校正所有像素。然而，透视投影打破了这一点。不同深度的像素以不同的速度漂移，无法通过单次固定校正所有深度。纹素喷溅完全避免了这个问题。场景几何体从世界中的固定点渲染到立方体贴图中，每个纹素作为世界空间四边形喷溅到屏幕上。立方体贴图索引提供了旋转不变性。将原点网格固定提供了平移不变性。主要限制在于固定原点无法看到所有几何体；探测边界处的遮挡解除仍然是一个开放的权衡问题。

View on arXiv Download PDF AI Translation

cs.CV / 227 / 2603.14609

GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data

GroundSet：一个基于地籍数据的空间理解数据集

Ferrod, Roger, Lecene, Maël, Sapkota, Krishna, Leifman, George, Silverman, Vered, Beryozkin, Genady, Lobry, Sylvain

Abstract

Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.

Chinese Translation

在地球观测中，精确的空间理解对于将原始航空影像转化为可用于城市规划、环境监测和灾害管理等关键应用的可操作洞察至关重要。然而，多模态大型语言模型在遥感中的细粒度空间理解方面存在严重不足，主要是由于依赖于有限或重新利用的遗留数据集。为了解决这一问题，我们引入了一个基于可验证地籍矢量数据的大规模数据集，包含380万个标注对象，覆盖510,000张高分辨率图像和135个细粒度语义类别。我们通过涵盖七个空间推理任务的全面指令调优基准来验证这一资源。我们的评估使用标准的LLaVA架构建立了一个稳健的基线。我们表明，尽管当前专注于遥感的商业模型（例如，Gemini）在零-shot设置中表现不佳，但高保真监督有效地弥补了这一差距，使得标准架构能够在不进行复杂架构修改的情况下掌握细粒度的空间定位。

View on arXiv Download PDF AI Translation

cs.CV / 228 / 2603.14610

Make it SING: Analyzing Semantic Invariants in Classifiers

使其具有语义：分析分类器中的语义不变性

Yadid, Harel, Levi, Meir Yossef, Betser, Roy, Gilboa, Guy

Abstract

All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.

Chinese Translation

所有分类器，包括最先进的视觉模型，都具有不变性，这部分源于其线性映射的几何特性。这些不变性存在于分类器的零空间中，诱导出一组等效的输入，这些输入映射到相同的输出。这些不变性的语义内容仍然模糊，因为现有方法难以提供人类可解释的信息。为了解决这一问题，我们提出了零空间几何的语义解释（Semantic Interpretation of the Null-space Geometry, SING），该方法构建与网络相关的等效图像，并为可用的变体分配语义解释。我们使用从网络特征到多模态视觉语言模型的映射。这使我们能够获得自然语言描述和诱导的语义变化的视觉示例。SING可以应用于单个图像，揭示局部不变性，或应用于图像集，从而在类别和模型层面进行广泛的统计分析。例如，我们的方法揭示了ResNet50将相关的语义属性泄漏到零空间，而DinoViT（一个经过自监督DINO预训练的ViT）在保持不变空间中的类别语义方面表现更优。

View on arXiv Download PDF AI Translation

cs.CV / 229 / 2603.14621

A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans

基于异构集成的多中心 COVID-19 胸部 CT 扫描分类

Nilay, Aadit, Thapar, Bhavesh, Agrawal, Anant, Teli, Mohammad Nayeem

Abstract

The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.

Chinese Translation

COVID-19 大流行暴露了诊断工作流程中的关键局限性：RT-PCR 检测存在周转时间慢和假阴性率高的问题，而基于 CT 的筛查提供了更快的补充诊断，但需要专家的放射学解读。在多个医院中心部署自动化 CT 分析带来了进一步的挑战，因为扫描仪硬件、采集协议和患者群体的差异导致显著的领域转移，从而降低了单一模型的性能。为了解决这些挑战，我们提出了一种包含九个模型的异构集成，涵盖三种推理范式：（1）具有切片级 sigmoid 聚合的自监督 DINOv2 视觉变换器，（2）具有切片级 sigmoid 平均的 RadImageNet 预训练 DenseNet-121，以及（3）使用 EfficientNet-B3、ConvNeXt-Tiny 和 EfficientNetV2-S 主干的七个门控注意力多实例学习模型，采用扫描级 softmax 分类。通过随机种子变化和随机权重平均，进一步增强了集成的多样性。我们通过结合焦点损失、嵌入级 Mixup 和领域感知增强来解决严重的过拟合问题，将验证与训练损失比从 35 倍降低到不到 3 倍。模型输出通过得分加权概率平均进行融合，并通过每个来源的阈值优化进行校准。最终集成在四个医院中心实现了 0.9280 的平均宏 F1 值，超越了最佳单一模型（F1=0.8969）0.031，证明了结合源感知校准的异构架构对于稳健的多站点医学图像分类至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 230 / 2603.14632

Continual Few-shot Adaptation for Synthetic Fingerprint Detection

合成指纹检测的持续少样本适应

Benjamin, Joseph Geo, Jain, Anil K., Nandakumar, Karthik

Abstract

The quality and realism of synthetically generated fingerprint images have increased significantly over the past decade fueled by advancements in generative artificial intelligence (GenAI). This has exacerbated the vulnerability of fingerprint recognition systems to data injection attacks, where synthetic fingerprints are maliciously inserted during enrollment or authentication. Hence, there is an urgent need for methods to detect if a fingerprint image is real or synthetic. While it is straightforward to train deep neural network (DNN) models to classify images as real or synthetic, often such DNN models overfit the training data and fail to generalize well when applied to synthetic fingerprints generated using unseen GenAI models. In this work, we formulate synthetic fingerprint detection as a continual few-shot adaptation problem, where the objective is to rapidly evolve a base detector to identify new types of synthetic data. To enable continual few-shot adaptation, we employ a combination of binary cross-entropy and supervised contrastive (applied to the feature representation) losses and replay a few samples from previously known styles during fine-tuning to mitigate catastrophic forgetting. Experiments based on several DNN backbones (as feature extractors) and a variety of real and synthetic fingerprint datasets indicate that the proposed approach achieves a good trade-off between fast adaptation for detecting unseen synthetic styles and forgetting of known styles.

Chinese Translation

在过去十年中，合成生成指纹图像的质量和真实感显著提高，这得益于生成性人工智能（GenAI）的进步。这加剧了指纹识别系统对数据注入攻击的脆弱性，即在注册或认证过程中恶意插入合成指纹。因此，迫切需要检测指纹图像是真实的还是合成的方法。虽然训练深度神经网络（DNN）模型以将图像分类为真实或合成相对简单，但这些DNN模型往往会过拟合训练数据，并在应用于使用未见过的GenAI模型生成的合成指纹时，无法很好地泛化。在本研究中，我们将合成指纹检测表述为一个持续少样本适应问题，其目标是快速演变基础检测器，以识别新类型的合成数据。为了实现持续少样本适应，我们采用了二元交叉熵和监督对比（应用于特征表示）损失的组合，并在微调过程中重放一些来自先前已知风格的样本，以减轻灾难性遗忘。基于多种DNN骨干网络（作为特征提取器）和多种真实及合成指纹数据集的实验表明，所提出的方法在检测未见合成风格的快速适应性与已知风格的遗忘之间取得了良好的平衡。

View on arXiv Download PDF AI Translation

cs.CV / 231 / 2603.14645

Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

谱匹配：潜在扩散中优越扩散性的统一视角

Ning, Mang, Li, Mingxiao, Zhang, Le, Liu, Lanmiao, Blaschko, Matthew B., Salah, Albert Ali, Ertugrul, Itir Onal

Abstract

In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.

Chinese Translation

在本文中，我们研究了变分自编码器（VAE）在潜在扩散中的扩散性（可学习性）。首先，我们展示了使用均方误差（MSE）目标训练的像素空间扩散在本质上偏向于学习低频和中频空间特征，并且自然图像的幂律功率谱密度（PSD）使得这种偏向在感知上是有利的。受到这一结果的启发，我们提出了 extit{谱匹配假说}：具有优越扩散性的潜变量应当（i）遵循平坦的幂律PSD（ extit{编码谱匹配}，ESM），并且（ii）通过解码器保持频率到频率的语义对应关系（ extit{解码谱匹配}，DSM）。在实践中，我们通过在图像和潜变量之间匹配PSD来应用ESM，并通过频率对齐重建的共享谱掩蔽来实现DSM。重要的是，谱匹配提供了一个统一的视角，澄清了对过于噪声或过于平滑的潜变量的先前观察，并将几种最近的方法解释为特例（例如，VA-VAE，EQ-VAE）。实验表明，谱匹配在CelebA和ImageNet数据集上产生了优越的扩散生成效果，并且优于先前的方法。最后，我们将谱视角扩展到表示对齐（REPA）：我们展示了目标表示的方向谱能量对REPA至关重要，并提出了一种基于差分高斯（DoG）的方法，以进一步提高REPA的性能。我们的代码可在 https://github.com/forever208/SpectrumMatching 获取。

View on arXiv Download PDF AI Translation

cs.CV / 232 / 2603.14647

TopoCL: Topological Contrastive Learning for Medical Imaging

TopoCL：用于医学影像的拓扑对比学习

Meng, Guangyu, Gu, Pengfei, Liang, Peixian, Lalor, John P., Chambers, Erin Wolf, Chen, Danny Z.

Abstract

Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.

Chinese Translation

对比学习（Contrastive Learning, CL）已成为从未标记图像中学习表示的一种强大方法。然而，现有的 CL 方法主要集中于视觉外观特征，而忽视了提供医学影像分析重要线索的拓扑特征（例如，连通模式、边界配置、腔体形成）。为了解决这一局限性，我们提出了一种新的拓扑对比学习框架（TopoCL），该框架在医学影像的对比学习过程中明确利用拓扑结构。具体而言，我们首先引入了拓扑感知增强方法，通过持久性图之间的相对瓶颈距离控制拓扑扰动，保留医学相关的拓扑属性，同时实现受控的结构变化。然后，我们设计了一个层次拓扑编码器，通过自注意力和交叉注意力机制捕捉拓扑特征。最后，我们开发了一个自适应专家混合（Mixture-of-Experts, MoE）模块，以动态整合视觉和拓扑表示。TopoCL 可以与现有的 CL 方法无缝集成。我们在五种代表性的 CL 方法（SimCLR、MoCo-v3、BYOL、DINO 和 Barlow Twins）以及五个多样化的医学影像分类数据集上评估了 TopoCL。实验结果表明，TopoCL 在线性探测分类准确率上实现了一致的提升：平均提高 +3.26%，且具有显著的统计学意义，验证了其有效性。

View on arXiv Download PDF AI Translation

cs.CV / 233 / 2603.14658

Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos

人类与人工智能集成提高低至中等质量视频中的深度伪造检测

Postiglione, Marco, Gortner, Isabel, Subrahmanian, V. S.

Abstract

Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.

Chinese Translation

深度伪造检测通常被视为一个机器学习问题，但在现实条件下人类与人工智能检测器的比较仍然不够清晰。我们评估了200名参与者和95个最先进的人工智能检测器，使用了两个数据集：DF40，一个标准基准，以及CharadesDF，一个新颖的日常活动视频数据集。CharadesDF是使用手机录制的，相较于更专业拍摄的DF40，其视频质量较低/中等。人类在两个数据集上的表现均优于人工智能检测器，尤其在CharadesDF中，差距更为明显，人工智能的准确率几乎降至偶然水平（0.537），而人类则保持了稳健的表现（0.784）。人类和人工智能的错误是互补的：人类错过高质量的深度伪造，而人工智能检测器则将真实视频标记为伪造，混合的人类与人工智能集成减少了高置信度错误。这些发现表明，尤其是在非专业制作的视频中，有效的现实世界深度伪造检测需要人类与人工智能的协作，而非仅依赖人工智能算法。

View on arXiv Download PDF AI Translation

cs.CV / 234 / 2603.14659

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

VisionCoach：通过视觉感知提示增强基于视频的推理

Lee, Daeun, Yu, Shoubin, Zhang, Yue, Bansal, Mohit

Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.

Chinese Translation

视频推理要求模型在帧间定位和跟踪与问题相关的证据。尽管使用可验证奖励的强化学习（RL）提高了准确性，但在推理过程中仍然难以实现可靠的时空基础。更重要的是，改善基础通常依赖于规模化的训练数据或推理时的感知工具，这增加了标注成本或计算成本。为了解决这一挑战，我们提出了VisionCoach，一种输入自适应的RL框架，通过视觉提示作为训练时的指导来改善时空基础。在RL训练过程中，视觉提示被选择性地应用于具有挑战性的输入，以增强与问题相关的证据并抑制干扰项。模型随后通过自蒸馏内化这些改进，使其能够在推理时直接在原始视频上进行有基础的推理，而无需视觉提示。VisionCoach由两个组件组成：（1）视觉提示选择器，根据视频和问题预测适当的提示类型；（2）时空推理器，在视觉提示指导下和对象意识基础奖励的优化下进行RL训练，确保对象身份一致性和多区域边界框重叠。大量实验表明，VisionCoach在可比设置下，在多种视频推理、视频理解和时序基础基准（V-STAR、VideoMME、World-Sense、VideoMMMU、PerceptionTest和Charades-STA）中实现了最先进的性能，同时保持单一高效的推理路径，无需外部工具。我们的结果表明，训练期间的视觉提示改善了基于视频的有基础推理，而自蒸馏使模型能够在推理时无需提示地内化这一能力。

View on arXiv Download PDF AI Translation

cs.CV / 235 / 2603.14666

EviATTA: Evidential Active Test-Time Adaptation for Medical Segment Anything Models

EviATTA：针对医学Segment Anything模型的证据主动测试时间适应

Chen, Jiayi, George, Yasmeen, Chong, Winston, Cai, Jianfei

Abstract

Deploying foundational medical Segment Anything Models (SAMs) via test-time adaptation (TTA) is challenging under large distribution shifts, where test-time supervision is often unreliable. While active test-time adaptation (ATTA) introduces limited expert feedback to improve reliability, existing ATTA methods still suffer from unreliable uncertainty estimation and inefficient utilization of sparse annotations. To address these issues, we propose Evidential Active Test-Time Adaptation (EviATTA), which is, to our knowledge, the first ATTA framework tailored for medical SAMs. Specifically, we adopt the Dirichlet-based Evidential Modeling to decompose overall predictive uncertainty into distribution uncertainty and data uncertainty. Building on this decomposition, we design a Hierarchical Evidential Sampling strategy, where image-wise distribution uncertainty is used to select informative shifted samples, while distance-aware data uncertainty guides sparse pixel annotations to resolve data ambiguities. We further introduce Dual Consistency Regularization, which enforces progressive prompt consistency on sparsely labeled samples to better exploit sparse supervision and applies variational feature consistency on unlabeled samples to stabilize adaptation. Extensive experiments on six medical image segmentation datasets demonstrate that EviATTA consistently improves adaptation reliability with minimal expert feedback under both batch-wise and instance-wise test-time adaptation settings.

Chinese Translation

在大规模分布变化下，通过测试时间适应（TTA）部署基础医学Segment Anything模型（SAMs）面临挑战，此时测试时间监督往往不可靠。尽管主动测试时间适应（ATTA）引入了有限的专家反馈以提高可靠性，但现有的ATTA方法仍然存在不可靠的不确定性估计和稀疏注释的低效利用等问题。为了解决这些问题，我们提出了证据主动测试时间适应（EviATTA），据我们所知，这是第一个专为医学SAMs量身定制的ATTA框架。具体而言，我们采用基于Dirichlet的证据建模，将整体预测不确定性分解为分布不确定性和数据不确定性。在此分解的基础上，我们设计了一种层次证据采样策略，其中图像级分布不确定性用于选择信息丰富的偏移样本，而基于距离的数据不确定性则指导稀疏像素注释以解决数据歧义。我们进一步引入了双重一致性正则化，强制在稀疏标记样本上逐步保持提示一致性，以更好地利用稀疏监督，并在未标记样本上应用变分特征一致性以稳定适应。在六个医学图像分割数据集上的大量实验表明，EviATTA在批量和实例测试时间适应设置下，均能在最小专家反馈的情况下持续提高适应可靠性。

View on arXiv Download PDF AI Translation

cs.CV / 236 / 2603.14667

Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models

基于阐明扩散模型的MRI超分辨率的3D卷积与2.5D切片条件U-Net架构的比较分析

Chiche, Hendrik, Corcos, Ludovic, Rouge, Logan

Abstract

Magnetic resonance imaging (MRI) super-resolution (SR) methods that computationally enhance low-resolution acquisitions to approximate high-resolution quality offer a compelling alternative to expensive high-field scanners. In this work we investigate an elucidated diffusion model (EDM) framework for brain MRI SR and compare two U-Net backbone architectures: (i) a full 3D convolutional U-Net that processes volumetric patches with 3D convolutions and multi-head self-attention, and (ii) a 2.5D slice-conditioned U-Net that super-resolves each slice independently while conditioning on an adjacent slice for inter-slice context. Both models employ continuous-sigma noise conditioning following Karras et al. and are trained on the NKI cohort of the FOMO60K dataset. On a held-out test set of 5 subjects (6 volumes, 993 slices), the 3D model achieves 37.75 dB PSNR, 0.997 SSIM, and 0.020 LPIPS, improving on the off-the-shelf pretrained EDSR baseline (35.57 dB / 0.024 LPIPS) and the 2.5D variant (35.82 dB) across all three metrics under the same test data and degradation pipeline.

Chinese Translation

磁共振成像（MRI）超分辨率（SR）方法通过计算增强低分辨率采集数据以接近高分辨率质量，为昂贵的高场扫描仪提供了一个引人注目的替代方案。在本研究中，我们探讨了一种用于脑部MRI超分辨率的阐明扩散模型（EDM）框架，并比较了两种U-Net主干架构：（i）一个完整的3D卷积U-Net，该模型使用3D卷积和多头自注意力处理体积补丁；（ii）一个2.5D切片条件U-Net，该模型独立超分辨率每个切片，同时依赖于相邻切片以提供切片间的上下文。两个模型均采用Karras等人提出的连续σ噪声条件，并在FOMO60K数据集的NKI队列上进行训练。在一个包含5个受试者（6个体积，993个切片）的保留测试集上，3D模型达到了37.75 dB的PSNR，0.997的SSIM和0.020的LPIPS，在相同的测试数据和降级管道下，超越了现成的预训练EDSR基线（35.57 dB / 0.024 LPIPS）和2.5D变体（35.82 dB）的所有三个指标。

View on arXiv Download PDF AI Translation

cs.CV / 237 / 2603.14684

E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction

E2EGS：无姿态3D重建的事件到边缘高斯喷溅

Kim, Yunsoo, Sung, Changki, Hong, Dasol, Myung, Hyun

Abstract

The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods either assume known poses or rely on depth estimation models that are bounded by their initial observations, failing to generalize as the camera traverses previously unseen regions. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.

Chinese Translation

神经辐射场（NeRF）和3D高斯喷溅（3DGS）的出现推动了新颖视图合成（NVS）的发展。然而，这些方法需要高质量的RGB输入和准确的对应姿态，在快速相机运动或不利光照等现实条件下限制了其鲁棒性。事件相机以高时间分辨率和宽动态范围捕捉每个像素的亮度变化，能够精确感知动态场景，提供了一种有前景的解决方案。然而，现有的基于事件的NVS方法要么假设已知姿态，要么依赖于受初始观测限制的深度估计模型，无法在相机遍历以前未见区域时进行泛化。我们提出了E2EGS，一个仅基于事件流操作的无姿态框架。我们的关键见解是边缘信息提供了丰富的结构线索，对于准确的轨迹估计和高质量的NVS至关重要。为了从嘈杂的事件流中提取边缘，我们利用了边缘和非边缘区域的独特时空特征。事件相机的运动在边缘上诱导出一致的事件，而非边缘区域则产生稀疏噪声。我们通过基于补丁的时间一致性分析来利用这一点，测量局部方差以提取边缘，同时稳健地抑制噪声。提取的边缘引导结构感知的高斯初始化，并在初始化、跟踪和束调整过程中实现边缘加权损失。对合成和真实数据集的广泛实验表明，E2EGS实现了卓越的重建质量和轨迹准确性，建立了一个完全无姿态的基于事件的3D重建范式。

View on arXiv Download PDF AI Translation

cs.CV / 238 / 2603.14686

MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

MVHOI：通过3D基础模型桥接多视角条件与复杂人机交互视频重现

Tong, Jinguang, Wu, Jinbo, Wang, Kaisiyuan, Shen, Zhelun, Huang, Xuan, Xiang, Mochu, Li, Xuesong, Li, Yingying, Feng, Haocheng, Zhao, Chen, Zhou, Hang, He, Wei, Nguyen, Chuong, Wang, Jingdong, Li, Hongdong

Abstract

Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.

Chinese Translation

人机交互（HOI）视频重现以逼真的运动效果仍然是表现性数字人类创作的前沿领域。现有的方法主要处理简单的图像平面运动（例如，平面内平移），在处理复杂的非平面操作（如平面外重新定向）时面临挑战。本文提出了MVHOI，一个两阶段的HOI视频重现框架，通过3D基础模型（3DFM）将多视角参考条件与视频基础模型连接起来。3DFM首先生成基于隐式运动动态的视图一致性对象先验，适用于新视角。随后，一个可控的视频生成模型通过结合多视角参考图像合成高保真度的对象纹理，通过合理的检索机制确保外观一致性。通过使这两个阶段在推理阶段相互强化，我们的框架在生成具有复杂对象操作的长时长HOI视频方面表现出优越的性能。大量实验表明，与之前的方法相比，尤其是在处理复杂3D对象操作的HOI时，取得了显著的改进。

View on arXiv Download PDF AI Translation

cs.CV / 239 / 2603.14694

Robust Building Damage Detection in Cross-Disaster Settings Using Domain Adaptation

基于领域适应的跨灾害环境下稳健的建筑损伤检测

Mouradi, Asmae, Kshirsagar, Shruti

Abstract

Rapid structural damage assessment from remote sensing imagery is essential for timely disaster response. Within human-machine systems (HMS) for disaster management, automated damage detection provides decision-makers with actionable situational awareness. However, models trained on multi-disaster benchmarks often underperform in unseen geographic regions due to domain shift - a distributional mismatch between training and deployment data that undermines human trust in automated assessments. We explore a two-stage ensemble approach using supervised domain adaptation (SDA) for building damage classification across four severity classes. The pipeline adapts the xView2 first-place method to the Ida-BD dataset using SDA and systematically investigates the effect of individual augmentation components on classification performance. Comprehensive ablation experiments on the unseen Ida-BD test split demonstrate that SDA is indispensable: removing it causes damage detection to fail entirely. Our pipeline achieves the most robust performance using SDA with unsharp-enhanced RGB input, attaining a Macro-F1 of 0.5552. These results underscore the critical role of domain adaptation in building trustworthy automated damage assessment modules for HMS-integrated disaster response.

Chinese Translation

从遥感图像中快速评估结构损伤对于及时的灾害响应至关重要。在灾害管理的人机系统（HMS）中，自动化损伤检测为决策者提供了可操作的情境意识。然而，针对多灾害基准训练的模型在未见过的地理区域往往表现不佳，这主要是由于领域迁移——训练数据与部署数据之间的分布不匹配，导致人们对自动评估的信任下降。我们探讨了一种基于监督领域适应（SDA）的两阶段集成方法，用于四个严重程度类别的建筑损伤分类。该流程将xView2第一名方法调整到Ida-BD数据集，并系统地研究各个增强组件对分类性能的影响。在未见过的Ida-BD测试集上的综合消融实验表明，SDA是不可或缺的：去除它会导致损伤检测的完全失败。我们的流程在使用SDA和未锐化增强的RGB输入时达到了最稳健的性能，获得了0.5552的Macro-F1值。这些结果强调了领域适应在构建可靠的自动化损伤评估模块中对集成HMS的灾害响应的重要作用。

View on arXiv Download PDF AI Translation

cs.CV / 240 / 2603.14701

AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild

AURORA-KITTI：野外任何天气下的深度补全与去噪

Wang, Yiting, Brödermann, Tim, Haghighi, Hamed, Zhao, Haonan, Sakaridis, Christos, Debattista, Kurt, Donzella, Valentina

Abstract

Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit{82K} weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.

Chinese Translation

稳健的深度补全对于现实世界的三维场景理解至关重要，但现有的RGB-LiDAR融合方法在恶劣天气条件下表现显著下降，摄影机图像和LiDAR测量均受到天气引起的干扰。在本文中，我们介绍了AURORA-KITTI，这是第一个针对野外稳健深度补全的大规模多模态、多天气基准。我们进一步将深度补全与去噪（Depth Completion and Denoising, DCD）形式化为一个统一的任务，该任务从受损的稀疏输入中联合重建稠密深度图，同时抑制天气引起的噪声。AURORA-KITTI包含超过82K个天气一致的RGB-L图像对，并提供了度量深度的真实值，涵盖多种天气类型、三个严重程度级别、白天和夜间场景、配对的干净参考、镜头遮挡条件，以及文本描述。此外，我们引入了DDCD，一种高效的基于蒸馏的方法，利用深度基础模型将干净的结构先验注入到野外DCD训练中。DDCD在AURORA-KITTI和现实世界DENSE数据集上达到了最先进的性能，同时保持高效率。值得注意的是，我们的结果进一步表明，具备天气感知和物理一致性的数据比单纯的架构修改更有助于稳健性。数据和代码将在发表时发布。

View on arXiv Download PDF AI Translation

cs.CV / 241 / 2603.14702

Fractal Autoregressive Depth Estimation with Continuous Token Diffusion

基于连续标记扩散的分形自回归深度估计

Zhang, Jinchang, Kang, Xinrou, Lu, Guoyu

Abstract

Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.

Chinese Translation

单目深度估计可以从自回归（AR）生成中受益，但直接的AR建模受到RGB与深度之间的模态差距、低效的逐像素生成以及连续深度预测的不稳定性等因素的限制。我们提出了一种分形视觉自回归扩散框架，将深度估计重新表述为一种从粗到细的下一尺度自回归生成过程。VCFR模块将多尺度图像特征与当前深度预测融合，以改善跨模态条件，而条件去噪扩散损失则直接在连续空间中建模深度分布，并减轻由离散量化引起的误差。为了提高计算效率，我们将尺度生成器组织成分形递归架构，在自相似层次中重用基础视觉AR单元。我们进一步引入了一种不确定性感知的稳健共识聚合方案，用于多样本推理，以提高融合的稳定性并提供实用的逐像素可靠性估计。在标准基准上的实验表明了强大的性能，并验证了所提设计的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 242 / 2603.14706

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

AdapterTune：用于冻结视觉变换器的零初始化低秩适配器

Khazem, Salim

Abstract

Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune

Chinese Translation

使用视觉变换器进行冻结骨干网络迁移面临两个未充分解决的问题：当适配器被简单地插入固定特征提取器时的优化不稳定性，以及缺乏设置适配器容量的原则性指导。我们提出了AdapterTune，它通过一个残差低秩瓶颈增强每个变换器块，其上投影为零初始化，确保适配后的网络准确地从预训练功能开始，并消除早期时期的表示漂移。在分析方面，我们将适配器秩形式化为在特征空间中近似下游任务变化的容量预算。由此产生的过度风险分解预测随着秩的增加，准确性收益呈单调但递减的趋势，这种“肘部”行为我们通过控制实验得到了验证。我们在9个数据集和3个骨干网络规模上进行了评估，并在整个过程中进行了多种种子报告。在核心的5个数据集迁移套件中，AdapterTune在仅训练0.92个全微调所需参数的情况下，平均提高了+14.9个百分点的top-1准确率，并在15个数据集-骨干网络对中有10个超越了全微调。在整个基准测试中，AdapterTune在每个测试的数据集-骨干网络对上均优于仅头部迁移。关于秩、位置和初始化的消融实验则对每个设计选择进行了隔离。代码可在以下链接获取：https://github.com/salimkhazem/adaptertune

View on arXiv Download PDF AI Translation

cs.CV / 243 / 2603.14707

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

视觉混淆代理：利用和防御计算机使用代理中的感知失败

Liu, Xunzhuo, He, Bowei, Liu, Xue, Luo, Andy, Zhang, Haichen, Chen, Huamin

Abstract

Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent's reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: https://github.com/vllm-project/semantic-router}.

Chinese Translation

计算机使用代理（CUAs）直接作用于图形用户界面，但它们对屏幕的感知往往不可靠。现有研究主要将这些失败视为性能限制，关注某个动作是否成功，而不是代理是否在正确的对象上行动。我们认为这根本上是一个安全问题。我们形式化了视觉混淆代理：一种失败模式，其中代理基于错误感知的屏幕状态授权某个动作，这种情况可能由于基础错误、对抗性屏幕截图操控或检查时机与使用时机（TOCTOU）竞争引起。这个差距在实践中是可以被利用的：即使是简单的屏幕级操控也可以将常规点击重定向为特权操作，同时与普通代理错误无异。为了缓解这一威胁，我们提出了第一个在代理感知循环之外操作的防护措施。我们的方法，双通道对比分类，独立评估（1）视觉点击目标和（2）代理对该动作的推理，基于特定部署的知识库进行对比，如果任一通道指示风险，则阻止执行。关键见解在于这两个通道捕捉到互补的失败模式：视觉证据检测目标级别的不匹配，而文本推理揭示了视觉上无害的控件背后的危险意图。在控制攻击、真实GUI截图和代理轨迹的测试中，组合防护措施始终优于单独的任一通道。我们的结果表明，CUA的安全性不仅需要更好的动作生成，还需要对代理所认为的点击对象及其原因进行独立验证。材料已提供。

View on arXiv Download PDF AI Translation

cs.CV / 244 / 2603.14726

Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator

通过条件手部调节器提升三维全身姿态估计中的手部表现

Moon, Gyeongsik

Abstract

Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.

Chinese Translation

在身体上下文中准确恢复手部姿态仍然是三维全身姿态估计中的一大挑战。这一困难源于一个基本的监督差距：全身姿态估计器是在手部多样性有限的全身数据集上训练的，而仅针对手部的估计器虽然在手指详尽的关节运动上表现出色，但缺乏全球身体意识。为了解决这个问题，我们提出了Hand4Whole++，一个模块化框架，利用了预训练全身和手部姿态估计器的优势。我们引入了CHAM（条件手部调节器），一个轻量级模块，通过从预训练的手部姿态估计器提取的手部特征来调节全身特征流。这种调节使得全身模型能够预测 wrist（手腕）方向，既准确又与上半身的运动结构一致，而无需重新训练全身模型。同时，我们直接整合了手部姿态估计器预测的手指关节运动和手部形状，通过可微分的刚性对齐将其与全身网格对齐。这一设计使Hand4Whole++能够将全球一致的身体推理与细致的手部细节相结合。大量实验表明，Hand4Whole++显著提高了手部准确性，并提升了整体全身姿态的质量。

View on arXiv Download PDF AI Translation

cs.CV / 245 / 2603.14727

Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach

通过前段眼部成像进行自动化糖尿病筛查：一种深度学习和可解释人工智能的方法

Maqsood, Hasaan, Khan, Saif Ur Rehman, Vollmer, Sebastian, Dengel, Andreas, Asim, Muhammad Nabeel

Abstract

Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model performance.EfficientNet-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.

Chinese Translation

糖尿病视网膜病变筛查传统上依赖于眼底摄影，这需要专业设备和通常在初级医疗和资源有限环境中无法获得的专业知识。我们开发并验证了一种深度学习（DL）系统，用于利用前段眼部成像进行自动化糖尿病分类，这是一种利用标准摄影设备的易获取替代方案。该系统利用虹膜、巩膜和结膜中的可见生物标志物，这些生物标志物与全身糖尿病状态相关。我们系统地评估了五种现代架构（EfficientNet-V2-S结合自监督学习（SSL）、视觉变换器（Vision Transformer）、Swin变换器、ConvNeXt-Base和ResNet-50）在2640幅临床注释的前段图像上的表现，这些图像涵盖了正常、控制良好的糖尿病和控制不良的糖尿病类别。我们实施了一种定制的预处理流程，结合了镜面反射减弱和对比度限制自适应直方图均衡（CLAHE），以增强对分类至关重要的细微血管和纹理模式。使用SimCLR在特定领域的眼部图像上进行的SSL显著提高了模型性能。结合SSL的EfficientNet-V2-S达到了最佳性能，F1-score为98.21%，精确度为97.90%，召回率为98.55%，相较于仅使用ImageNet初始化（94.63% F1）有了显著提升。值得注意的是，该模型在正常分类中达到了近乎完美的精确度（100%），这对于减少不必要的临床转诊至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 246 / 2603.14733

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

一种技能增强的自主框架及多视频理解的基准

Zhang, Yue, Jing, Liqiang, Li, Jia, Tian, Yapeng, Du, Xinya, Guo, Yunhui, Gogate, Vibhav

Abstract

Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.

Chinese Translation

多模态大型语言模型在单视频理解方面表现出色，但其在多个视频之间进行推理的能力仍然有限。现有方法通常将多个视频连接成单一输入进行直接推理，这导致训练与推理不匹配、因帧压缩造成的信息损失以及缺乏明确的跨视频协调。同时，目前的多视频基准主要强调事件级别的比较，而身份级别匹配、细粒度识别和结构化多步骤推理则未得到充分探索。为了解决这些问题，我们提出了MVX-Bench，这是一种多视频跨维度基准，将11个经典计算机视觉任务重新构造成统一的多视频问答框架，涵盖了来自多样化真实世界数据集的4,255个视频中的1,442个问题。我们进一步提出SAMA，一种技能增强的自主框架用于多视频理解，该框架整合了视觉工具、任务特定技能以及冲突意识验证机制，以实现迭代和结构化的推理。实验结果表明，SAMA在MVX-Bench上的表现超过了强大的开源基线和GPT，消融实验验证了技能设计和冲突解决的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 247 / 2603.14738

Efficient Event Camera Volume System

高效事件相机体积系统

Soto, Juan Camilo, Noronha, Ian, Bharti, Saru, Kaur, Upinder

Abstract

Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce \nameframew (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7X higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.

Chinese Translation

事件相机承诺低延迟和高动态范围，但其稀疏输出对标准机器人管道的集成构成挑战。我们提出了 ameframew（高效事件相机体积系统），这是一个新颖的框架，将事件流建模为连续时间的狄拉克脉冲列，通过在事件时间戳处的直接变换评估实现无伪影压缩。我们的关键创新结合了基于密度的自适应选择，涵盖离散余弦变换（DCT）、离散时间傅里叶变换（DTFT）和离散小波变换（DWT），并针对每个领域的稀疏特性量身定制了特定于变换的系数剪枝策略。该框架消除了时间分箱伪影，同时根据实时事件密度分析自动调整压缩策略。在EHPT-XC和MVSEC数据集上，我们的框架实现了卓越的重建保真度，其中DTFT提供了最低的地球搬运距离。在下游分割任务中，EECVS展示了强大的泛化能力。值得注意的是，我们的方法在跨数据集泛化方面表现出色：在使用EventSAM分割进行评估时，EECVS在MVSEC上实现了平均交并比（IoU）0.87，而体素网格在24个通道上的IoU为0.44，同时在EHPT-XC上仍具竞争力。我们的ROS2实现提供了实时部署，DCT处理实现了1.5毫秒的延迟和比其他变换高2.7倍的吞吐量，建立了第一个自适应事件压缩框架，保持了计算效率和在多样化机器人场景中的卓越泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 248 / 2603.14739

TrajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective

TrajMamba：一种基于自我运动引导的Mamba模型，用于从自我中心视角预测行人轨迹

Peng, Yusheng, Zhang, Gaofeng, Zheng, Liping

Abstract

Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.

Chinese Translation

从自我中心视角预测被跟踪行人的未来轨迹是自动驾驶和机器人导航等领域中的一项关键任务。该任务的挑战在于自我摄像头与被跟踪行人之间复杂的动态相对运动。为了解决这一挑战，我们提出了一种基于Mamba模型的自我运动引导轨迹预测网络。首先，使用两个Mamba模型作为编码器，分别从行人运动和自我车辆运动中提取行人运动和自我运动特征。然后，设计了一个自我运动引导的Mamba解码器，通过将行人运动特征作为历史上下文与自我运动特征作为引导线索相结合，明确建模行人与车辆之间的相对运动，以捕获解码特征。最后，从对应于未来时间戳的解码特征中生成未来轨迹。大量实验表明，所提模型的有效性，在PIE和JAAD数据集上达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 249 / 2603.14741

PHAC: Promptable Human Amodal Completion

PHAC：可提示的人类无模态补全

Noh, Seung Young, Chang, Ju Yong

Abstract

Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.

Chinese Translation

条件图像生成方法在以人为中心的应用中越来越多地被使用，但现有的人类无模态补全（HAC）模型对完成内容的用户控制能力有限。给定一张被遮挡的人物图像，它们在保留可见区域的同时幻觉出不可见区域，但无法可靠地纳入用户指定的约束条件，如期望的姿态或空间范围。因此，用户往往不得不反复采样模型，直到获得满意的输出。基于姿态引导的人物图像合成（PGPIS）方法允许显式的姿态条件，但通常无法保持实例特定的可见外观，并且即使在强扩散模型先验的基础上构建，也往往偏向于训练分布。为了解决这些局限性，我们引入了可提示的人类无模态补全（PHAC），这是一个新的任务，旨在完成被遮挡的人类图像，同时满足可见外观约束和多个用户提示。用户提供简单的基于点的提示，例如目标姿态的附加关节或期望区域的边界框；这些提示通过专门针对每种提示类型的ControlNet模块进行编码。这些模块将提示信号注入到预训练的扩散模型中，我们仅微调交叉注意力块，以获得强提示对齐，而不降低基础生成先验的质量。为了进一步保留可见内容，我们提出了一种基于修复的精细化模块，该模块从略微噪声的粗略补全开始，忠实地保留可见区域，并确保在遮挡边界处无缝融合。在HAC和PGPIS基准上的广泛实验表明，我们的方法产生了更具物理合理性和更高质量的补全，同时显著改善了与现有无模态补全和姿态引导合成方法的提示对齐。

View on arXiv Download PDF AI Translation

cs.CV / 250 / 2603.14750

Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

基于人脸引导的情感边界增强用于弱监督时序情感定位

Han, Cailing, Li, Zhangbin, Zhou, Jinxing, Qian, Wei, Hu, Jingjing, Zhou, Yanghao, Duan, Zhangling, Guo, Dan

Abstract

Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.

Chinese Translation

点级弱监督时序情感定位（P-WTSL）旨在利用时间戳情感注释检测未裁剪多模态视频中的情感相关片段，这大大减少了成本高昂的帧级标注。为进一步解决P-WTSL中不精确情感边界的挑战，我们提出了基于人脸引导的情感边界增强网络（FSENet），这是一个统一框架，利用细粒度的人脸特征来指导情感定位。具体而言，我们的方法 extit{首先}引入了人脸引导情感发现（FSD）模块，通过双分支建模将人脸特征融入多模态交互，以有效提取情感刺激线索； extit{然后}我们提出了点感知情感语义对比（PSSC）策略，通过对比学习区分接近注释点的候选点（帧级）的情感语义，从而增强模型识别情感边界的能力； extit{最后}，我们设计了边界感知情感伪标签生成（BSPG）方法，将稀疏点注释转换为时间上平滑的监督伪标签。在基准测试中的广泛实验和可视化结果表明，我们的框架有效性，实现在全监督、视频级和点级弱监督下的最先进性能，从而展示了我们FSENet在不同注释设置下的强泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 251 / 2603.14764

Topology-Preserving Data Augmentation for Ring-Type Polygon Annotations

保持拓扑结构的数据增强用于环形多边形标注

Laudari, Sudip, Baek, Sang Hun

Abstract

Geometric data augmentation is widely used in segmentation pipelines and typically assumes that polygon annotations represent simply connected regions. However, in structured domains such as architectural floorplan analysis, ring-type regions are often encoded as a single cyclic polygon chain connecting outer and inner boundaries. During augmentation, clipping operations may remove intermediate vertices and disrupt this cyclic connectivity, breaking the structural relationship between the boundaries. In this work, we introduce an order-preserving polygon augmentation strategy that performs transformations in mask space and then projects surviving vertices back into index-space to restore adjacency relations. This repair maintains the original traversal order of the polygon and preserves topological consistency with minimal computational overhead. Experiments demonstrate that the approach reliably restores connectivity, achieving near-perfect Cyclic Adjacency Preservation (CAP) across both single and compound augmentations.

Chinese Translation

几何数据增强在分割管道中被广泛应用，通常假设多边形标注表示简单连通区域。然而，在建筑平面图分析等结构化领域，环形区域通常被编码为连接外边界和内边界的单一循环多边形链。在增强过程中，裁剪操作可能会移除中间顶点，从而破坏这种循环连通性，打破边界之间的结构关系。在本研究中，我们提出了一种保持顺序的多边形增强策略，该策略在掩膜空间中执行变换，然后将存活的顶点投影回索引空间，以恢复邻接关系。这种修复保持了多边形的原始遍历顺序，并以最小的计算开销保持拓扑一致性。实验表明，该方法可靠地恢复了连通性，在单一和复合增强中均实现了近乎完美的循环邻接保持（Cyclic Adjacency Preservation, CAP）。

View on arXiv Download PDF AI Translation

cs.CV / 252 / 2603.14765

SSR: A Training-Free Approach for Streaming 3D Reconstruction

SSR：一种无训练的流式三维重建方法

Deng, Hui, Mao, Yuxin, He, Yuxin, Dai, Yuchao

Abstract

Streaming 3D reconstruction demands long-horizon state updates under strict latency constraints, yet stateful recurrent models often suffer from geometric drift as errors accumulate over time. We revisit this problem from a Grassmannian manifold perspective: the latent persistent state can be viewed as a subspace representation, i.e., a point evolving on a Grassmannian manifold, where temporal coherence implies the state trajectory should remain on (or near) this manifold.Based on this view, we propose Self-expressive Sequence Regularization (SSR), a plug-and-play, training-free operator that enforces Grassmannian sequence regularity during inference.Given a window of historical states, SSR computes an analytical affinity matrix via the self-expressive property and uses it to regularize the current update, effectively pulling noisy predictions back toward the manifold-consistent trajectory with minimal overhead. Experiments on long-sequence benchmarks demonstrate that SSR consistently reduces drift and improves reconstruction quality across multiple streaming 3D reconstruction tasks.

Chinese Translation

流式三维重建在严格的延迟约束下需要进行长时间的状态更新，然而有状态的递归模型往往因错误的累积而遭受几何漂移。我们从Grassmann流形的角度重新审视这一问题：潜在的持久状态可以被视为子空间表示，即在Grassmann流形上演化的点，其中时间一致性意味着状态轨迹应保持在（或接近）该流形上。基于这一观点，我们提出了自表达序列正则化（Self-expressive Sequence Regularization，SSR），这是一种即插即用的、无训练的算子，在推理过程中强制执行Grassmann序列的正则性。给定一段历史状态窗口，SSR通过自表达特性计算一个解析的亲和矩阵，并利用它来正则化当前更新，有效地将噪声预测拉回到与流形一致的轨迹上，且开销最小。在长序列基准测试中的实验表明，SSR始终减少漂移并提高多个流式三维重建任务的重建质量。

View on arXiv Download PDF AI Translation

cs.CV / 253 / 2603.14770

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

AnyPhoto：基于位置画布的多人物身份保持图像生成与 ID 自适应调制

Yuan, Longhui

Abstract

Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.

Chinese Translation

多人物身份保持生成需要将多个参考面孔绑定到指定位置，并在文本提示下进行操作。强烈的身份/布局条件往往会触发复制粘贴的捷径，从而削弱基于提示的可控性。我们提出了 AnyPhoto，这是一种扩散-变换器微调框架，具有以下特点：(i) 与 RoPE 对齐的位置画布以及位置对齐的标记剪枝以实现空间定位，(ii) 来自人脸识别嵌入的 AdaLN 风格身份自适应调制，以实现持久的身份注入，以及 (iii) 身份隔离注意力，以防止跨身份干扰。训练结合了条件流匹配与嵌入空间人脸相似度损失，同时进行参考面孔替换和位置画布降级，以抑制捷径。在 MultiID-Bench 上，AnyPhoto 提高了身份相似度，同时减少了复制粘贴的倾向，且随着身份数量的增加，提升效果愈加显著。AnyPhoto 还支持基于提示的风格化，并能准确定位，显示出巨大的潜在应用价值。

View on arXiv Download PDF AI Translation

cs.CV / 254 / 2603.14772

Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

基于单幅图像的可动画3D虚拟形象的零样本重建与布料动力学

Kwon, Joohyun, Sim, Geonhee, Moon, Gyeongsik

Abstract

Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.

Chinese Translation

现有的单幅图像3D人类虚拟形象方法主要依赖于刚性关节变换，限制了其模拟真实布料动力学的能力。我们提出了DynaAvatar，这是一种零样本框架，可以从单幅图像重建具有运动依赖布料动力学的可动画3D人类虚拟形象。DynaAvatar在大规模多人的运动数据集上进行训练，采用基于Transformer的前馈架构，直接预测动态3D高斯变形，而无需特定于个体的优化。为了克服动态捕捉数据的稀缺性，我们引入了一种静态到动态的知识转移策略：在大规模静态捕捉上预训练的Transformer提供了强大的几何和外观先验，这些先验通过在动态捕捉上进行轻量级LoRA微调有效适应运动依赖的变形。我们进一步提出了DynaFlow损失，这是一种基于光流的目标函数，为渲染空间中的布料动力学提供可靠的运动方向几何线索。最后，我们重新标注了现有动态捕捉数据集中缺失或噪声的SMPL-X拟合，因为大多数公共动态捕捉数据集包含不完整或不可靠的拟合，不适合用于训练高质量的3D虚拟形象重建模型。实验表明，DynaAvatar生成了视觉丰富且具有良好泛化能力的动画，超越了之前的方法。

View on arXiv Download PDF AI Translation

cs.CV / 255 / 2603.14781

High-Fidelity 3D Facial Avatar Synthesis with Controllable Fine-Grained Expressions

高保真3D人脸头像合成与可控细粒度表情

He, Yikang, Zhang, Jichao, Wang, Wei, Sebe, Nicu, Zhao, Yao

Abstract

Facial expression editing methods can be mainly categorized into two types based on their architectures: 2D-based and 3D-based methods. The former lacks 3D face modeling capabilities, making it difficult to edit 3D factors effectively. The latter has demonstrated superior performance in generating high-quality and view-consistent renderings using single-view 2D face images. Although these methods have successfully used animatable models to control facial expressions, they still have limitations in achieving precise control over fine-grained expressions. To address this issue, in this paper, we propose a novel approach by simultaneously refining both the latent code of a pretrained 3D-Aware GAN model for texture editing and the expression code of the driven 3DMM model for mesh editing. Specifically, we introduce a Dual Mappers module, comprising Texture Mapper and Emotion Mapper, to learn the transformations of the given latent code for textures and the expression code for meshes, respectively. To optimize the Dual Mappers, we propose a Text-Guided Optimization method, leveraging a CLIP-based objective function with expression text prompts as targets, while integrating a SubSpace Projection mechanism to project the text embedding to the expression subspace such that we can have more precise control over fine-grained expressions. Extensive experiments and comparative analyses demonstrate the effectiveness and superiority of our proposed method.

Chinese Translation

人脸表情编辑方法主要可以根据其架构分为两类：基于2D的方法和基于3D的方法。前者缺乏3D面部建模能力，使得有效编辑3D因素变得困难。后者在使用单视图2D人脸图像生成高质量和视图一致的渲染方面表现出色。尽管这些方法成功地使用可动画模型来控制面部表情，但在实现对细粒度表情的精确控制方面仍然存在局限性。为了解决这个问题，本文提出了一种新颖的方法，通过同时细化预训练的3D-Aware GAN模型的潜在编码（latent code）以进行纹理编辑，以及驱动的3DMM模型的表情编码（expression code）以进行网格编辑。具体而言，我们引入了一个双映射器模块（Dual Mappers），包括纹理映射器（Texture Mapper）和表情映射器（Emotion Mapper），分别学习给定潜在编码的纹理变换和网格的表情编码变换。为了优化双映射器，我们提出了一种文本引导优化（Text-Guided Optimization）方法，利用基于CLIP的目标函数，并将表情文本提示作为目标，同时集成子空间投影机制（SubSpace Projection），将文本嵌入投影到表情子空间，从而使我们能够对细粒度表情进行更精确的控制。大量实验和比较分析证明了我们提出方法的有效性和优越性。

View on arXiv Download PDF AI Translation

cs.CV / 256 / 2603.14790

Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making

导演思维：通过协作决策的多模态代理驱动电影预视觉化

Nan, Shufeng, Li, Mengtian, Zheng, Sixiao, Lu, Yuwei, Zhang, Han, Fu, Yanwei

Abstract

We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.

Chinese Translation

我们提出了导演思维（Mind-of-Director），这是一个多模态代理驱动的电影预视觉化框架，旨在模拟电影制作团队的协作决策过程。在给定创意的基础上，导演思维协调多个专业代理，在游戏引擎中生成预视觉化序列。该框架由四个合作模块构成：剧本发展（Script Development），代理在此模块中迭代草拟和完善剧本；虚拟场景设计（Virtual Scene Design），将文本转换为语义对齐的三维环境；角色行为控制（Character Behaviour Control），决定角色的站位和动作；以及镜头规划（Camera Planning），优化拍摄构图、运动以及电影镜头效果的构图。基于游戏引擎构建的实时视觉编辑系统进一步支持在场景、行为和镜头间进行交互检查和同步时间线调整。大量实验证明和人类评估结果表明，导演思维在大约25分钟内为每个创意生成高质量且语义上相符的预视觉化序列，展示了代理协作在自动化原型制作和人机协作电影制作中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 257 / 2603.14794

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

面对面：一个用于多人人际互动建模的视频数据集

Chu, Ernie, Patel, Vishal M.

Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Chinese Translation

建模人类对话的反应节奏仍然困难，因为大多数音视频数据集描绘的是孤立的发言者进行短暂独白。我们介绍了 extbf{与吉米·法伦的面对面（Face-to-Face with Jimmy Fallon, F2F-JF）}，这是一个包含70小时、14000个片段的两人脱口秀交流数据集，保留了嘉宾发言与主持人回应之间的顺序依赖关系。一个半自动化的流程结合了多人人物追踪、语音分离和轻量级人类验证，以提取时间对齐的主持人/嘉宾轨迹，并提供紧凑的剪辑和准备好用于下游建模的元数据。我们通过一个反应式、以语音驱动的数字化身任务展示该数据集，其中在时间段$[t_1,t_2]$的主持人视频是根据他们的音频和嘉宾在时间段$[t_0,t_1]$的前一个视频生成的。将MultiTalk风格的扩散模型与这种跨人物视觉上下文相结合，虽然在情感FID和FVD上获得了小幅但一致的提升，但相较于仅音频基线仍保持了良好的同步口型质量。该数据集、预处理方案和基线共同提供了一个端到端的蓝图，用于研究双人、顺序行为，本文对此进行了扩展。数据集和代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 258 / 2603.14796

Global Truncated Loss Minimization for Robust and Threshold-Resilient Geometric Estimation

针对鲁棒性和阈值抗扰性的几何估计的全局截断损失最小化

Huang, Tianyu, Peng, Liangzu, Zhang, Xinyue, Guan, Tongfan, Dong, Jinhu, Li, Haoang, Kneip, Laurent, Liu, Yun-Hui

Abstract

To achieve outlier-robust geometric estimation, robust objective functions are generally employed to mitigate the influence of outliers. The widely used consensus maximization(CM) is highly robust when paired with global branch-and-bound(BnB) search. However, CM relies solely on inlier counts and is sensitive to the inlier threshold. Besides, the discrete nature of CM leads to loose bounds, necessitating extensive BnB iterations and computation cost. Truncated losses(TL), another continuous alternative, leverage residual information more effectively and could potentially overcome these issues. But to our knowledge, no prior work has systematically explored globally minimizing TL with BnB and its potential for enhanced threshold resilience or search efficiency. In this work, we propose GTM, the first unified BnB-based framework for globally-optimal TL loss minimization across diverse geometric problems. GTM involves a hybrid solving design: given an n-dimensional problem, it performs BnB search over an (n-1)-dimensional subspace while the remaining 1D variable is solved by bounding the objective function. Our hybrid design not only reduces the search space, but also enables us to derive Lipschitz-continuous bounding functions that are general, tight, and can be efficiently solved by a classic global Lipschitz solver named DIRECT, which brings further acceleration. We conduct a systematic evaluation on various BnB-based methods for CM and TL on the robust linear regression problem, showing that GTM enjoys remarkable threshold resilience and the highest efficiency compared to baseline methods. Furthermore, we apply GTM on different geometric estimation problems with diverse residual forms. Extensive experiments demonstrate that GTM achieves state-of-the-art outlier-robustness and threshold-resilience while maintaining high efficiency across these estimation tasks.

Chinese Translation

为了实现对离群点鲁棒的几何估计，通常采用鲁棒目标函数来减轻离群点的影响。广泛使用的共识最大化（Consensus Maximization, CM）在与全局分支界限（Branch-and-Bound, BnB）搜索相结合时具有很高的鲁棒性。然而，CM仅依赖于内点计数，并对内点阈值敏感。此外，CM的离散特性导致界限松弛，需进行大量的BnB迭代和计算开销。截断损失（Truncated Loss, TL）作为另一种连续替代方案，更有效地利用残差信息，可能克服这些问题。但据我们所知，之前的研究尚未系统地探索与BnB结合的全局最小化TL及其在增强阈值抗扰性或搜索效率方面的潜力。在本研究中，我们提出了GTM，这是第一个基于BnB的统一框架，用于在多种几何问题中全局最优地最小化TL损失。GTM采用混合求解设计：对于一个n维问题，它在(n-1)维子空间上执行BnB搜索，而剩余的1D变量通过界定目标函数来求解。我们的混合设计不仅减少了搜索空间，还使我们能够推导出通用、紧致且可以通过经典全局Lipschitz求解器DIRECT高效求解的Lipschitz连续界限函数，从而进一步加速。我们对多种基于BnB的CM和TL方法在鲁棒线性回归问题上进行了系统评估，结果表明GTM在阈值抗扰性和效率方面显著优于基线方法。此外，我们将GTM应用于不同几何估计问题，涉及多种残差形式。大量实验表明，GTM在这些估计任务中实现了最先进的离群点鲁棒性和阈值抗扰性，同时保持了高效率。

View on arXiv Download PDF AI Translation

cs.CV / 259 / 2603.14807

HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System

HiMemVLN：通过层次记忆系统增强开源零-shot视觉与语言导航的可靠性

Lyu, Kailin, Wu, Kangyi, Li, Pengna, Hu, Xiuyu, Si, Qingyi, Miao, Cui, Yang, Ning, Wang, Zihang, Xiao, Long, Hu, Lianyu, Sun, Jingyuan, Hao, Ce

Abstract

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.

Chinese Translation

基于大型语言模型（LLM）的智能体在视觉-语言导航（VLN）任务中展示了令人印象深刻的零-shot表现。然而，大多数零-shot方法主要依赖于闭源LLM作为导航器，这面临着高令牌成本和潜在数据泄露风险等挑战。近期的研究尝试通过结合开源LLM与时空链式推理（CoT）框架来解决这一问题，但与闭源模型相比，仍然相差甚远。在本研究中，我们通过对导航过程的详细分析识别出一个关键问题，即导航遗忘（Navigation Amnesia）。该问题导致导航失败，并加大了开源与闭源方法之间的差距。为了解决这一问题，我们提出了HiMemVLN，它将层次记忆系统（Hierarchical Memory System）融入多模态大型模型中，以增强视觉感知回忆和长期定位，缓解遗忘问题并提高智能体的导航性能。在模拟和真实环境中的大量实验表明，HiMemVLN的性能几乎是当前开源最先进方法的两倍。代码可在 https://github.com/lvkailin0118/HiMemVLN 获取。

View on arXiv Download PDF AI Translation

cs.CV / 260 / 2603.14816

M2IR: Proactive All-in-One Image Restoration via Mamba-style Modulation and Mixture-of-Experts

M2IR：通过曼巴风格调制和专家混合的主动一体化图像恢复

Wang, Shiwei, Wang, Yongzhen, Hu, Bingwen, Zhang, Liyan, Zhang, Xiao-Ping, Wei, Mingqiang

Abstract

While Transformer-based architectures have dominated recent advances in all-in-one image restoration, they remain fundamentally reactive: propagating degradations rather than proactively suppressing them. In the absence of explicit suppression mechanisms, degraded signals interfere with feature learning, compelling the decoder to balance artifact removal and detail preservation, thereby increasing model complexity and limiting adaptability. To address these challenges, we propose M2IR, a novel restoration framework that proactively regulates degradation propagation during the encoding stage and efficiently eliminates residual degradations during decoding. Specifically, the Mamba-Style Transformer (MST) block performs pixel-wise selective state modulation to mitigate degradations while preserving structural integrity. In parallel, the Adaptive Degradation Expert Collaboration (ADEC) module utilizes degradation-specific experts guided by a DA-CLIP-driven router and complemented by a shared expert to eliminate residual degradations through targeted and cooperative restoration. By integrating the MST block and ADEC module, M2IR transitions from passive reaction to active degradation control, effectively harnessing learned representations to achieve superior generalization, enhanced adaptability, and refined recovery of fine-grained details across diverse all-in-one image restoration benchmarks. Our source codes are available at https://github.com/Im34v/M2IR.

Chinese Translation

尽管基于Transformer的架构在一体化图像恢复的最新进展中占据主导地位，但它们在根本上仍然是被动的：传播退化而不是主动抑制它们。在缺乏明确抑制机制的情况下，退化信号干扰特征学习，迫使解码器在去除伪影和保持细节之间进行平衡，从而增加模型复杂性并限制适应性。为了解决这些挑战，我们提出了M2IR，这是一种新颖的恢复框架，主动调节编码阶段的退化传播，并在解码过程中高效消除残余退化。具体而言，曼巴风格Transformer（MST）模块执行逐像素选择性状态调制，以减轻退化，同时保持结构完整性。与此同时，自适应退化专家协作（ADEC）模块利用由DA-CLIP驱动的路由器指导的退化特定专家，并通过共享专家补充，针对性地和协同地消除残余退化。通过整合MST模块和ADEC模块，M2IR从被动反应转变为主动退化控制，有效利用学习到的表示，实现卓越的泛化能力、增强的适应性以及在多样化的一体化图像恢复基准上精细恢复细节。我们的源代码可在 https://github.com/Im34v/M2IR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 261 / 2603.14819

RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

RAZOR：面向视觉变换器和扩散模型的比例感知层编辑以实现目标性遗忘

Ranjan, Ravi, Grover, Utkarsh, Lin, Xiaomin, Polyzou, Agoritsa

Abstract

Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.

Chinese Translation

基于变换器的扩散和视觉-语言模型取得了显著成功；然而，在不重新训练的情况下高效地去除不必要或敏感信息仍然是模型安全性和合规性面临的主要挑战。我们提出了比例感知零/一步优化保留遗忘（RAZOR），这是一种轻量级、模型无关的遗忘框架，能够将遗忘更新推广到变换器主干中的协调多层和多头编辑。RAZOR通过测量各层和注意力头对遗忘目标数据的贡献，同时保留有用知识，来识别最重要的层和注意力头。然后，它使用经过精心正则化的规则更新模型的这些部分，以避免损害整体性能。编辑组件的集合逐渐增加，确保精确的遗忘而不至于过度编辑或损害无关能力。我们在CLIP、Stable Diffusion和视觉-语言模型（VLMs）上评估RAZOR，使用广泛采用的遗忘基准，涵盖身份、风格和物体擦除任务。我们的结果表明，RAZOR在量化情况下也能实现高度准确和稳定的遗忘。这种方法比以往的方法提供了更强的保留能力和更好的效率。值得注意的是，它的运行速度也显著快于传统技术。这些结果表明，RAZOR是变换器基础视觉模型中安全、适应性遗忘的实用且可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 262 / 2603.14822

RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving

RadarXFormer：通过4D雷达光谱与图像的跨维融合实现鲁棒的目标检测以支持自动驾驶

Sun, Yue, Qian, Yeqiang, Wang, Zhe, Li, Tianhui, Wang, Chunxiang, Yang, Ming

Abstract

Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The "X" highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.

Chinese Translation

可靠的感知对于自动驾驶系统在多样化的现实交通条件下安全运行至关重要。然而，基于摄像头和激光雷达的感知系统在恶劣天气和光照条件下性能下降，限制了其鲁棒性和在智能交通系统中的大规模应用。雷达与视觉的融合提供了一种有前景的替代方案，通过将毫米波（mmWave）雷达的环境鲁棒性和成本效益与摄像头捕获的丰富语义信息相结合。然而，传统的3D雷达测量缺乏高度分辨率且高度稀疏，而新兴的4D毫米波雷达引入了高度信息，但也带来了信号噪声和数据量大的挑战。为了解决这些问题，本文提出了RadarXFormer，一个3D目标检测框架，能够高效地实现4D雷达光谱与RGB图像之间的跨模态融合。RadarXFormer不依赖于稀疏的雷达点云，而是直接利用原始雷达光谱，构建一个高效的3D表示，减少数据量的同时保留完整的3D空间信息。“X”突出了所提出的跨维（3D-2D）融合机制，其中多尺度3D球形雷达特征立方体与互补的2D图像特征图进行融合。在K-Radar数据集上的实验表明，在具有挑战性的条件下检测精度和鲁棒性得到了改善，同时保持了实时推理能力。

View on arXiv Download PDF AI Translation

cs.CV / 263 / 2603.14825

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

一箭双雕：通过推理时特征投影协调大型视觉语言模型的安全性与实用性

Han, Yewon, Seol, Yumin, Kong, EunGyung, Jo, Minsoo, Kim, Taesup

Abstract

Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.

Chinese Translation

现有的大型视觉语言模型（LVLM）的防越狱框架往往面临安全性与实用性之间的权衡，即增强安全性无意中降低了在一般视觉基础推理任务上的表现。在本研究中，我们探讨了安全性与实用性是否本质上是对立的目标。我们关注在不同数据集中一致观察到的一种由模态诱导的偏向方向，这种方向源于大型语言模型主干与视觉编码器之间的亚最优耦合。我们进一步证明，这一方向损害了两者的任务性能。基于这一洞察，我们提出了一种名为“一箭双雕”（Two Birds, One Projection）的高效推理时防越狱方法，该方法将跨模态特征投影到识别偏向方向的零空间，以去除相应的成分。该方法只需一次前向传播，即有效打破了传统的权衡，同时在不同基准测试中提高了安全性和实用性。

View on arXiv Download PDF AI Translation

cs.CV / 264 / 2603.14827

SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

SemanticFace：通过可解释空间中的语义蒸馏进行语义面部动作估计

Kang, Zejian, Zheng, Kai, Fei, Yuanchen, Yang, Wentao, Zou, Hongyuan, Huang, Xiangru

Abstract

Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.

Chinese Translation

从单幅图像中进行面部动作估计通常被表述为在紧凑的表情空间中预测或拟合参数，这些参数缺乏明确的语义可解释性。然而，许多实际应用，如虚拟形象控制和人机交互，要求可解释的面部动作与有意义的肌肉运动相对应。在本研究中，我们提出了 extbf{SemanticFace}，一个在可解释的ARKit混合形状空间中进行面部动作估计的框架，该框架将系数预测重新表述为结构化的语义推理。SemanticFace采用了两阶段的语义蒸馏范式：首先从真实的ARKit系数中推导出结构化的语义监督，然后将这一知识蒸馏到多模态大型语言模型中，以从图像中预测可解释的面部动作系数。大量实验表明，语言对齐的语义监督提高了系数的准确性和感知一致性，同时增强了跨身份的泛化能力和对大领域转移的鲁棒性，包括卡通面孔。

View on arXiv Download PDF AI Translation

cs.CV / 265 / 2603.14832

Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis

迈向三维：集成2.5D和3D模型以实现稳健的COVID-19 CT诊断

Yang, Tuan-Anh, Bui, Bao V. Q., Vo-Van, Chanh-Quang, Hy, Truong-Son

Abstract

We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH

Chinese Translation

我们提出了一种深度学习框架，用于从胸部CT扫描中检测COVID-19和进行疾病分类，该框架整合了2.5D和3D表示，以捕捉互补的切片级和体积信息。2.5D分支使用DINOv3视觉变换器处理多视角CT切片（轴向、冠状、矢状），以提取稳健的视觉特征，而3D分支采用ResNet-18架构来建模体积上下文，并通过方差风险外推（Variance Risk Extrapolation, VREx）进行预训练，随后进行监督对比学习，以提高跨源稳健性。两个分支的预测通过logit级别的集成推理进行结合。在PHAROS-AIF-MIH基准测试中的实验表明了所提方法的有效性：在二元COVID-19检测中，集成模型达到了94.48%的准确率和0.9426的宏观F1分数，超越了各个单独模型；而在多类疾病分类中，2.5D DINOv3模型以79.35%的准确率和0.7497的宏观F1分数取得了最佳表现。这些结果突显了将预训练的基于切片的表示与体积建模相结合在稳健的多源医学影像分析中的优势。代码可在 https://github.com/HySonLab/PHAROS-AIF-MIH 获取。

View on arXiv Download PDF AI Translation

cs.CV / 266 / 2603.14837

DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery

DamageArbiter：一种基于CLIP增强的多模态仲裁框架，用于街景图像的飓风损害评估

Yang, Yifan, Zou, Lei, Gong, Wenjing, Fu, Kani, Li, Zongrong, Wang, Siqin, Zhou, Bing, Cai, Heng, Tian, Hao

Abstract

Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.

Chinese Translation

利用计算机视觉模型分析街景图像以进行快速、超本地的损害评估在应急响应和恢复中变得越来越流行且具有价值，但传统模型往往像黑箱，缺乏可解释性和可靠性。本研究提出了一种基于对比语言-图像预训练（CLIP）模型的多模态分歧驱动仲裁框架DamageArbiter，以提高从街景图像中进行损害估计的准确性、可解释性和鲁棒性。DamageArbiter利用单模态和多模态模型的互补优势，采用轻量级逻辑回归元分类器来仲裁分歧案例。使用2,556张灾后街景图像，并配以手动生成和大型语言模型（LLM）生成的文本描述，我们系统地比较了单模态模型（包括仅图像和仅文本模型）、基于CLIP的多模态模型和DamageArbiter的性能。值得注意的是，DamageArbiter将准确性从74.33%（ViT-B/32，仅图像）提高到82.79%，超过了80%的准确性阈值，并与最强基线模型相比实现了8.46%的绝对提升。除了整体准确性的提升外，与仅依赖图像的视觉模型相比，DamageArbiter通过仲裁单模态和多模态预测之间的差异，减轻了视觉模型中常见的过度自信错误，特别是在灾害视觉线索模糊或受到干扰的情况下，减少了过度自信但错误的预测。我们进一步绘制并分析了地理参考的预测和误分类，以比较不同地点的模型性能。总体而言，这项工作推动了基于街景的灾害评估，从粗略的严重性分类向更可靠和可解释的框架发展。

View on arXiv Download PDF AI Translation

cs.CV / 267 / 2603.14848

Personalized Federated Learning with Residual Fisher Information for Medical Image Segmentation

基于残差费舍尔信息的个性化联邦学习在医学图像分割中的应用

Zhu, Meilu, Li, Yuxing, Wang, Zhiwei, Lam, Edmund Y.

Abstract

Federated learning enables multiple clients (institutions) to collaboratively train machine learning models without sharing their private data. To address the challenge of data heterogeneity across clients, personalized federated learning (pFL) aims to learn customized models for each client. In this work, we propose pFL-ResFIM, a novel pFL framework that achieves client-adaptive personalization at the parameter level. Specifically, we introduce a new metric, Residual Fisher Information Matrix (ResFIM), to quantify the sensitivity of model parameters to domain discrepancies. To estimate ResFIM for each client model under privacy constraints, we employ a spectral transfer strategy that generates simulated data reflecting the domain styles of different clients. Based on the estimated ResFIM, we partition model parameters into domain-sensitive and domain-invariant components. A personalized model for each client is then constructed by aggregating only the domain-invariant parameters on the server. Extensive experiments on public datasets demonstrate that pFL-ResFIM consistently outperforms state-of-the-art methods, validating its effectiveness.

Chinese Translation

联邦学习使多个客户端（机构）能够在不共享私有数据的情况下协作训练机器学习模型。为了应对客户端之间数据异质性的问题，个性化联邦学习（pFL）旨在为每个客户端学习定制化模型。在本研究中，我们提出了pFL-ResFIM，一种新颖的pFL框架，能够在参数层面实现客户端自适应个性化。具体而言，我们引入了一种新的度量标准——残差费舍尔信息矩阵（Residual Fisher Information Matrix, ResFIM），用于量化模型参数对领域差异的敏感性。为了在隐私约束下为每个客户端模型估计ResFIM，我们采用了一种谱转移策略，生成反映不同客户端领域风格的模拟数据。基于估计的ResFIM，我们将模型参数划分为领域敏感和领域不变的组件。然后，通过在服务器上仅聚合领域不变参数，为每个客户端构建个性化模型。在公共数据集上的大量实验表明，pFL-ResFIM始终优于最先进的方法，验证了其有效性。

View on arXiv Download PDF AI Translation

cs.CV / 268 / 2603.14850

From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration

从工件到洞察：高效低秩适应BrushNet用于扫描探针显微镜图像恢复

Wei, Ziwei, Shen, Yao, Lu, Wanheng, Ho, Ghim Wei, Zeng, Kaiyang

Abstract

Scanning Probe Microscopy or SPM offers nanoscale resolution but is frequently marred by structured artefacts such as line scan dropout, gain induced noise, tip convolution, and phase hops. While most available methods treat SPM artefact removal as isolated denoising or interpolation tasks, the generative inpainting perspective remains largely unexplored. In this work, we introduce a diffusion based inpainting framework tailored to scientific grayscale imagery. By fine tuning less than 0.2 percent of BrushNet weights with rank constrained low rank adaptation (LoRA), we adapt a pretrained diffusion model using only 7390 artefact, clean pairs distilled from 739 experimental scans. On our forthcoming public SPM InpBench benchmark, the LoRA enhanced model lifts the Peak Signal to Noise Ratio or PSNR by 6.61 dB and halves the Learned Perceptual Image Patch Similarity or LPIPS relative to zero-shot inference, while matching or slightly surpassing the accuracy of full retraining, trainable on a single GPU instead of four high-memory cards. The approach generalizes across various SPM image channels including height, amplitude and phase, faithfully restores subtle structural details, and suppresses hallucination artefacts inherited from natural image priors. This lightweight framework enables efficient, scalable recovery of irreplaceable SPM images and paves the way for a broader diffusion model adoption in nanoscopic imaging analysis.

Chinese Translation

扫描探针显微镜（Scanning Probe Microscopy，SPM）提供纳米级分辨率，但常常受到结构化伪影的影响，如线扫描丢失、增益引起的噪声、探头卷积和相位跳跃。虽然大多数现有方法将SPM伪影去除视为孤立的去噪或插值任务，但生成性修复的视角仍然未被充分探索。在本研究中，我们引入了一种基于扩散的修复框架，专门针对科学灰度图像。通过对BrushNet权重进行低于0.2%的微调，并采用秩约束的低秩适应（Low-Rank Adaptation，LoRA），我们仅使用从7390个实验扫描中提取的7390对伪影和干净图像对，适配了一个预训练的扩散模型。在我们即将发布的公共SPM InpBench基准测试中，LoRA增强模型将峰值信噪比（Peak Signal to Noise Ratio，PSNR）提高了6.61 dB，并将学习感知图像块相似性（Learned Perceptual Image Patch Similarity，LPIPS）相对于零-shot推理减半，同时在准确性上与全量重训练相匹配或略有超出，且可在单个GPU上训练，而不是四个高内存卡。该方法在各种SPM图像通道（包括高度、幅度和相位）中具有良好的泛化能力，忠实恢复微妙的结构细节，并抑制从自然图像先验中继承的幻觉伪影。该轻量级框架实现了不可替代的SPM图像的高效、可扩展恢复，并为扩散模型在纳米级成像分析中的更广泛应用铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 269 / 2603.14851

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

AutoMoT：一种统一的视觉-语言-动作模型，采用异步混合变换器用于端到端自主驾驶

Huang, Wenhui, Zhang, Songyan, Huang, Qihang, Wang, Zhidong, Mao, Zhiqi, Chua, Collister, Chen, Zhan, Chen, Long, Lv, Chen

Abstract

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.

Chinese Translation

将视觉-语言模型（VLMs）集成到端到端（E2E）自主驾驶（AD）系统中，已显示出在改善场景理解方面的潜力。然而，现有的集成策略存在若干局限性：它们要么难以解决推理和动作空间之间的分布不一致，要么未能充分利用预训练VLMs的通用推理能力，或者在动作策略生成过程中引入显著的推理延迟，从而降低驾驶性能。为了解决这些挑战，我们在本工作中提出了 extit{AutoMoT}，这是一个将推理和动作生成统一于单一视觉-语言-动作（VLA）模型的端到端AD框架。我们的方法利用了一种混合变换器（MoT）架构，采用联合注意力共享，既保留了预训练VLMs的通用推理能力，又通过在不同任务频率下的异步执行实现了高效的快慢推理。在多个基准测试上的广泛实验，无论是在开放环路还是闭环设置下，均表明 extit{AutoMoT}在性能上与最先进的方法具有竞争力。我们进一步探讨了预训练VLMs在AD中的功能边界，考察了何时需要针对AD进行定制的微调。我们的结果表明，预训练VLMs仅通过语义提示就能实现竞争性的多任务场景理解性能，而微调对于决策和轨迹规划等动作级任务仍然至关重要。我们在 extit{Project Page}上提供了演示视频和定性结果。

View on arXiv Download PDF AI Translation

cs.CV / 270 / 2603.14856

From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness

从水平到旋转：具有方向感知的跨视角物体地理定位

Fu, Chenlin, Gong, Ao, Zhu, Yingying

Abstract

Cross-View object geo-localization (CVOGL) aims to precisely determine the geographic coordinates of a query object from a ground or drone perspective by referencing a satellite map. Segmentation-based approaches offer high precision but require prohibitively expensive pixel-level annotations, whereas more economical detection-based methods suffer from lower accuracy. This performance disparity in detection is primarily caused by two factors: the poor geometric fit of Horizontal Bounding Boxes (HBoxes) for oriented objects and the degradation in precision due to feature map scaling. Motivated by these, we propose leveraging Rotated Bounding Boxes (RBoxes) as a natural extension of the detection-based paradigm. RBoxes provide a much tighter geometric fit to oriented objects. Building on this, we introduce OSGeo, a novel geo-localization framework, meticulously designed with a multi-scale perception module and an orientation-sensitive head to accurately regress RBoxes. To support this scheme, we also construct and release CVOGL-R, the first dataset with precise RBox annotations for CVOGL. Extensive experiments demonstrate that our OSGeo achieves state-of-the-art performance, consistently matching or even surpassing the accuracy of leading segmentation-based methods but with an annotation cost that is over an order of magnitude lower.

Chinese Translation

跨视角物体地理定位（CVOGL）旨在通过参考卫星地图，精确确定查询物体在地面或无人机视角下的地理坐标。基于分割的方法提供了高精度，但需要代价高昂的像素级标注，而更经济的基于检测的方法则面临较低的准确性。这种检测性能差异主要由两个因素造成：水平边界框（Horizontal Bounding Boxes, HBoxes）对定向物体的几何拟合较差，以及由于特征图缩放导致的精度下降。基于此，我们提出利用旋转边界框（Rotated Bounding Boxes, RBoxes）作为基于检测范式的自然扩展。RBoxes 为定向物体提供了更紧密的几何拟合。在此基础上，我们引入了 OSGeo，一个新颖的地理定位框架，精心设计了多尺度感知模块和方向敏感头，以准确回归 RBoxes。为了支持这一方案，我们还构建并发布了 CVOGL-R，这是第一个具有精确 RBox 标注的 CVOGL 数据集。大量实验表明，我们的 OSGeo 达到了最先进的性能，始终与领先的基于分割的方法的准确性相匹配，甚至超越，但其标注成本却低了一个数量级以上。

View on arXiv Download PDF AI Translation

cs.CV / 271 / 2603.14861

Video Detector: A Dual-Phase Vision-Based System for Real-Time Traffic Intersection Control and Intelligent Transportation Analysis

视频检测器：一种基于视觉的双阶段实时交通交叉口控制与智能交通分析系统

Şen, Mustafa Fatih, Gümüşkaya, Halûk, Pazar, Şenol

Abstract

Urban traffic management increasingly requires intelligent sensing systems capable of adapting to dynamic traffic conditions without costly infrastructure modifications. Vision-based vehicle detection has therefore become a key technology for modern intelligent transportation systems. This study presents Video Detector (VD), a dual-phase vision-based traffic intersection management system designed as a flexible and cost-effective alternative to traditional inductive loop detectors. The framework integrates a real-time module (VD-RT) for intersection control with an offline analytical module (VD-Offline) for detailed traffic behavior analysis. Three system configurations were implemented using SSD Inception v2, Faster R-CNN Inception v2, and CenterNet ResNet-50 V1 FPN, trained on datasets totaling 108,000 annotated images across 6-10 vehicle classes. Experimental results show detection performance of up to 90% test accuracy and 29.5 [email protected], while maintaining real-time throughput of 37 FPS on HD video streams. Field deployments conducted in collaboration with Istanbul IT and Smart City Technologies Inc. (ISBAK) demonstrate stable operation under diverse environmental conditions. The system supports virtual loop detection, vehicle counting, multi-object tracking, queue estimation, speed analysis, and multiclass vehicle classification, enabling comprehensive intersection monitoring without the need for embedded road sensors. The annotated dataset and training pipeline are publicly released to support reproducibility. These results indicate that the proposed framework provides a scalable and deployable vision-based solution for intelligent transportation systems and smart-city traffic management.

Chinese Translation

城市交通管理日益需要能够适应动态交通条件的智能感知系统，而无需进行昂贵的基础设施改造。因此，基于视觉的车辆检测已成为现代智能交通系统的关键技术。本研究提出了视频检测器（Video Detector, VD），这是一种双阶段的基于视觉的交通交叉口管理系统，旨在作为传统感应环检测器的灵活且具有成本效益的替代方案。该框架集成了一个实时模块（VD-RT）用于交叉口控制，以及一个离线分析模块（VD-Offline）用于详细的交通行为分析。我们实施了三种系统配置，分别使用SSD Inception v2、Faster R-CNN Inception v2和CenterNet ResNet-50 V1 FPN，训练数据集总计108,000张标注图像，涵盖6-10个车辆类别。实验结果显示，检测性能达到90%的测试准确率和29.5 [email protected]，同时在高清（HD）视频流上保持37 FPS的实时吞吐量。与伊斯坦布尔信息技术与智能城市技术公司（ISBAK）合作进行的现场部署展示了在多种环境条件下的稳定运行。该系统支持虚拟环检测、车辆计数、多目标跟踪、排队估计、速度分析和多类别车辆分类，实现了全面的交叉口监控，而无需嵌入式道路传感器。标注数据集和训练流程已公开发布，以支持可重复性。这些结果表明，所提出的框架为智能交通系统和智慧城市交通管理提供了一种可扩展且可部署的基于视觉的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 272 / 2603.14880

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

RealVLG-R1：一个大规模真实世界视觉-语言基础基准，用于机器人感知与操作

Li, Linfei, Zhang, Lin, Shen, Ying

Abstract

Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at https://github.com/lif314/RealVLG-R1.

Chinese Translation

视觉-语言基础旨在建立自然语言与视觉实体之间的语义对应关系，使模型能够根据文本指令准确识别和定位目标物体。现有的视觉-语言基础（VLG）方法主要集中在粗粒度的物体级定位，而传统的机器人抓取方法主要依赖几何线索，缺乏语言指导，这限制了它们在语言驱动的操作场景中的适用性。为了解决这些局限性，我们提出了RealVLG框架，该框架整合了RealVLG-11B数据集和RealVLG-R1模型，以统一真实世界的视觉-语言基础和抓取任务。RealVLG-11B数据集提供了多粒度的注释，包括边界框、分割掩码、抓取姿势、接触点以及经过人工验证的细粒度语言描述，涵盖约165,000张图像、800多个物体实例、130万个分割、检测和语言注释，以及大约110亿个抓取示例。在此数据集的基础上，RealVLG-R1采用强化微调（Reinforcement Fine-tuning）技术，对预训练的大规模视觉-语言模型进行统一处理，以根据自然语言指令预测边界框、分割掩码、抓取姿势和接触点。实验结果表明，RealVLG支持在真实世界未见环境中的零样本感知和操作，建立了一个统一的语义-视觉多模态基准，为语言驱动的机器人感知和抓取策略学习提供了全面的数据和评估平台。所有数据和代码均可在 https://github.com/lif314/RealVLG-R1 获取。

View on arXiv Download PDF AI Translation

cs.CV / 273 / 2603.14882

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

LLMind：一种生物启发的无训练自适应视觉表征用于视觉-语言模型

Debnath, Soumyaratna, Manh, Bui Duc, Liu, Zinan, Wang, Lin

Abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

Chinese Translation

视觉-语言模型（VLMs）通常假设视觉输入整个视野的空间保真度是均匀的，对信息量有限的区域给予相同的精度。相比之下，人类视觉既不是均匀的，也不是静态的；它是自适应的、选择性的，并且资源高效。鉴于此，我们首次系统分析了生物启发的视觉表征方法，为更高效且自适应的VLMs提供了见解。我们提出了LLMind（Looking Like the Mind），一种新颖的无训练框架，模仿人类视觉中的注视编码和皮层放大，以在严格的像素预算下实现VLMs的自适应、高效表征。我们的关键思想是探索一种生物启发的自适应采样策略（BASS），它使得一个摩比乌斯参数化的模块能够进行非均匀采样，同时保持全局场景结构。在BASS的基础上，我们通过测试时适应引入闭环语义反馈（CSF），以将感知显著性与来自冻结VLM的文本信息对齐。我们在多样的场景级和区域引导的视觉问答基准测试中，将LLMind与均匀采样及其他采样基准进行评估。结果显示出显著的提升，在VQAv2上平均提高了20%，在Seed-Bench上提高了38%，在A-OKVQA上提高了37%，相较于均匀采样在严格的像素预算下。此外，更令人惊讶的是，LLMind在仅使用1%、3%和5%的像素时，分别保留了高达82%、92%和97%的全分辨率性能。此外，LLMind轻巧、即插即用，兼容现有的VLMs，无需进行结构性改变。

View on arXiv Download PDF AI Translation

cs.CV / 274 / 2603.14885

SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

SpiralDiff：基于LoRA的螺旋扩散用于跨相机的RGB到RAW转换

Yue, Huanjing, Xie, Shangbin, Cao, Cong, Wu, Qian, Zhang, Lei, Zhao, Lei, Yang, Jingyu

Abstract

RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code and model are available at https://github.com/Chuancy-TJU/SpiralDiff.

Chinese Translation

RAW图像相比于RGB图像能够保留更高的保真度和丰富的场景信息，使其在复杂成像条件下的任务中至关重要。为了降低数据收集的高成本，最近的RGB到RAW转换方法旨在从RGB合成RAW图像。然而，这些方法忽视了两个关键挑战：（i）重建难度随着像素强度的变化而变化，以及（ii）多相机转换需要特定于相机的适应性。为了解决这些问题，我们提出了SpiralDiff，这是一种基于扩散的框架，专为RGB到RAW转换而设计，采用信号依赖的噪声加权策略，以适应不同强度水平下的重建保真度。此外，我们引入了CamLoRA，这是一个相机感知的轻量级适应模块，使统一模型能够适应不同相机特定的图像信号处理（ISP）特性。在四个基准数据集上的大量实验表明，SpiralDiff在RGB到RAW转换质量方面具有优越性，并在基于RAW的目标检测中展现了其下游优势。我们的代码和模型可在https://github.com/Chuancy-TJU/SpiralDiff获取。

View on arXiv Download PDF AI Translation

cs.CV / 275 / 2603.14886

PASTE: Physics-Aware Scattering Topology Embedding Framework for SAR Object Detection

PASTE：基于物理的散射拓扑嵌入框架用于合成孔径雷达目标检测

Chen, Jiacheng, Xiong, Yuxuan, Wang, Haipeng

Abstract

Current deep learning-based object detection for Synthetic Aperture Radar (SAR) imagery mainly adopts optical image methods, treating targets as texture patches while ignoring inherent electromagnetic scattering mechanisms. Though scattering points have been studied to boost detection performance, most methods still rely on amplitude-based statistical models. Some approaches introduce frequency-domain information for scattering center extraction, but they suffer from high computation cost and poor compatibility with diverse datasets. Thus, effectively embedding scattering topological information into modern detection frameworks remains challenging. To solve these problems, this paper proposes the Physics-Aware Scattering Topology Embedding Framework (PASTE), a novel closed-loop architecture for comprehensive scattering prior integration. By building the full pipeline from topology generation, injection to joint supervision, PASTE elegantly integrates scattering physics into modern SAR detectors. Specifically, it designs a scattering keypoint generation and automatic annotation scheme based on the Attributed Scattering Center (ASC) model to produce scalable and physically consistent priors. A scattering topology injection module guides multi-scale feature learning, and a scattering prior supervision strategy constrains network optimization by aligning predictions with scattering center distributions. Experiments on real datasets show that PASTE is compatible with various detectors and brings relative mAP gains of 2.9% to 11.3% over baselines with acceptable computation overhead. Visualization of scattering maps verifies that PASTE successfully embeds scattering topological priors into feature space, clearly distinguishing target and background scattering regions, thus providing strong interpretability for results.

Chinese Translation

当前基于深度学习的合成孔径雷达（SAR）图像目标检测主要采用光学图像方法，将目标视为纹理块，同时忽视了固有的电磁散射机制。尽管散射点的研究旨在提升检测性能，但大多数方法仍依赖于基于幅度的统计模型。一些方法引入频域信息用于散射中心提取，但它们面临高计算成本和与多样化数据集兼容性差的问题。因此，将散射拓扑信息有效嵌入现代检测框架仍然具有挑战性。为了解决这些问题，本文提出了基于物理的散射拓扑嵌入框架（PASTE），这是一种新颖的闭环架构，用于全面整合散射先验。通过构建从拓扑生成、注入到联合监督的完整流程，PASTE 优雅地将散射物理学整合到现代 SAR 检测器中。具体而言，它设计了一种基于属性散射中心（ASC）模型的散射关键点生成和自动标注方案，以生成可扩展且物理一致的先验。散射拓扑注入模块引导多尺度特征学习，而散射先验监督策略通过将预测与散射中心分布对齐来约束网络优化。在真实数据集上的实验表明，PASTE 与各种检测器兼容，并在可接受的计算开销下带来了相对 mAP 增益 2.9% 至 11.3%。散射图的可视化验证了 PASTE 成功地将散射拓扑先验嵌入特征空间，清晰地区分了目标和背景散射区域，从而为结果提供了强大的可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 276 / 2603.14892

Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs

平衡显著性与覆盖率：面向语义显著性的视觉标记压缩预算管理

Lee, Jaehoon, Jung, Mingi, Jang, Soohyuk, Yoo, Seungryong, Jung, Dahuin, Yoon, Sungroh

Abstract

Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.

Chinese Translation

大型视觉语言模型（VLMs）通过利用高分辨率视觉输入，实现了强大的多模态理解能力，但由此产生的大量视觉标记造成了主要的计算瓶颈。近期的研究通过视觉标记压缩来缓解这一问题，通常基于显著性、多样性或两者的固定组合来压缩标记。我们观察到，语义显著性的分布在样本之间存在显著差异，导致局部显著性保持与全局覆盖之间的最佳权衡不同。这一观察表明，在所有样本上应用静态压缩策略可能并非最佳选择。基于这一见解，我们提出了PromPrune，一种样本自适应的视觉标记选择框架，由语义显著性感知的预算分配和两阶段选择管道组成。我们的方法根据每个样本的语义显著性分布，自适应地平衡局部显著性保持与全局覆盖。通过在局部显著区域和全局多样区域之间分配标记预算，我们的方法在高压缩比下仍能保持强劲的性能。在LLaVA-NeXT-7B上，我们的方法将FLOPs减少了88%，预填充延迟减少了22%，同时保持了97.5%的原始准确率。

View on arXiv Download PDF AI Translation

cs.CV / 277 / 2603.14909

TopoVST: Toward Topology-fidelitous Vessel Skeleton Tracking

TopoVST：朝向拓扑忠实的血管骨架追踪

Liu, Yaoyu, Zhang, Minghui, He, Junjun, Gu, Yun

Abstract

Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs multi-scale sphere graphs to sample the input image and employs graph neural networks to jointly estimate tracking directions and vessel radii. The utilization of multi-scale representations is enhanced through a gating-based feature fusion mechanism, while the issue of class imbalance during training is mitigated by embedding a geometry-aware weighting scheme into the directional loss. In addition, we design a wave-propagation-based skeleton tracking algorithm that explicitly mitigates the generation of spurious skeletons through space-occupancy filtering. We evaluate TopoVST on two vessel datasets with different geometries. Extensive comparisons with state-of-the-art baselines demonstrate that TopoVST achieves competitive performance in both overlapping and topological metrics. Our source code is available at: https://github.com/EndoluminalSurgicalVision-IMR/TopoVST.

Chinese Translation

自动提取血管骨架对许多临床应用至关重要。然而，准确描绘细血管骨架的拓扑结构仍然极具挑战性，主要原因在于经常出现的不连续性和伪骨架段的存在。为了解决这些问题，我们提出了TopoVST，这是一个拓扑忠实的血管骨架追踪器。TopoVST构建了多尺度球体图来对输入图像进行采样，并采用图神经网络共同估计追踪方向和血管半径。通过基于门控的特征融合机制，提高了多尺度表示的利用效率，而在训练过程中通过引入几何感知加权机制缓解了类别不平衡的问题。此外，我们设计了一种基于波传播的骨架追踪算法，明确减少了通过空间占据过滤生成伪骨架的现象。我们在两个具有不同几何形状的血管数据集上评估了TopoVST。与最先进的基线进行的广泛比较表明，TopoVST在重叠和拓扑指标上都达到了具有竞争力的性能。我们的源代码可在以下网址获取：https://github.com/EndoluminalSurgicalVision-IMR/TopoVST。

View on arXiv Download PDF AI Translation

cs.CV / 278 / 2603.14915

ILV: Iterative Latent Volumes for Fast and Accurate Sparse-View CT Reconstruction

ILV：用于快速准确稀疏视图CT重建的迭代潜在体积

Lee, Seungryong, Baek, Woojeong, Lee, Joosang, Park, Eunbyung

Abstract

A long-term goal in CT imaging is to achieve fast and accurate 3D reconstruction from sparse-view projections, thereby reducing radiation exposure, lowering system cost, and enabling timely imaging in clinical workflows. Recent feed-forward approaches have shown strong potential toward this overarching goal, yet their results still suffer from artifacts and loss of fine details. In this work, we introduce Iterative Latent Volumes (ILV), a feed-forward framework that integrates data-driven priors with classical iterative reconstruction principles to overcome key limitations of prior feed-forward models in sparse-view CBCT reconstruction. At its core, ILV constructs an explicit 3D latent volume that is repeatedly updated by conditioning on multi-view X-ray features and the learned anatomical prior, enabling the recovery of fine structural details beyond the reach of prior feed-forward models. In addition, we develop and incorporate several key architectural components, including an X-ray feature volume, group cross-attention, efficient self-attention, and view-wise feature aggregation, that efficiently realize its core latent volume refinement concept. Extensive experiments on a large-scale dataset of approximately 14,000 CT volumes demonstrate that ILV significantly outperforms existing feed-forward and optimization-based methods in both reconstruction quality and speed. These results show that ILV enables fast and accurate sparse-view CBCT reconstruction suitable for clinical use. The project page is available at: https://sngryonglee.github.io/ILV/.

Chinese Translation

CT成像的长期目标是从稀疏视图投影中实现快速且准确的3D重建，从而减少辐射暴露、降低系统成本，并在临床工作流程中实现及时成像。最近的前馈方法在实现这一总体目标方面显示出强大的潜力，但其结果仍然受到伪影和细节丢失的影响。在本研究中，我们引入了迭代潜在体积（ILV），这是一种前馈框架，将数据驱动的先验与经典的迭代重建原理相结合，以克服先前前馈模型在稀疏视图CBCT重建中的关键限制。ILV的核心是构建一个明确的3D潜在体积，该体积通过条件化多视图X射线特征和学习到的解剖先验进行反复更新，从而能够恢复超出先前前馈模型范围的细微结构细节。此外，我们开发并整合了几个关键的架构组件，包括X射线特征体积、组交叉注意力、高效自注意力和视图级特征聚合，这些组件有效地实现了其核心潜在体积精炼概念。在一个约14,000个CT体积的大规模数据集上进行的广泛实验表明，ILV在重建质量和速度上显著优于现有的前馈和基于优化的方法。这些结果表明，ILV能够实现适合临床使用的快速准确的稀疏视图CBCT重建。项目页面可访问：https://sngryonglee.github.io/ILV/

View on arXiv Download PDF AI Translation

cs.CV / 279 / 2603.14916

EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

EditHF-1M：百万规模丰富的人类偏好反馈用于图像编辑

Xu, Zitong, Duan, Huiyu, Ji, Zhongpeng, Zhang, Xinyun, Liu, Yutao, Min, Xiongkuo, Gu, Ke, Zhang, Jian, Xu, Shusong, Chen, Jinwei, Li, Bo, Zhai, Guangtao

Abstract

Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF.

Chinese Translation

近期，基于文本引导的图像编辑（TIE）模型取得了显著进展，但许多编辑后的图像仍然存在伪影、意外编辑和不美观内容等问题。尽管已有一些基准和方法被提出用于评估编辑图像，但缺乏可扩展的评估模型，这限制了图像编辑的人类反馈奖励模型的发展。为了解决这些挑战，我们首先介绍了 extbf{EditHF-1M}，这是一个百万规模的图像编辑数据集，包含超过2900万个人类偏好对和148,000个基于人类平均意见的评分，这些评分从视觉质量、指令一致性和属性保留三个维度进行评估。基于EditHF-1M，我们提出了 extbf{EditHF}，一个多模态大型语言模型（MLLM）基础的评估模型，以提供与人类对齐的图像编辑反馈。最后，我们介绍了 extbf{EditHF-Reward}，它利用EditHF作为奖励信号，通过强化学习优化基于文本引导的图像编辑模型。大量实验表明，EditHF在与人类偏好的对齐方面表现优越，并在其他数据集上展示了强大的泛化能力。此外，我们使用EditHF-Reward对Qwen-Image-Edit进行了微调，取得了显著的性能提升，这证明了EditHF作为奖励模型的能力，可以扩展图像编辑。数据集和代码将发布在我们的GitHub仓库：https://github.com/IntMeGroup/EditHF。

View on arXiv Download PDF AI Translation

cs.CV / 280 / 2603.14920

$\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling

$ ext{F}^2 ext{HDR}$：通过流适配器和物理运动建模的两阶段HDR视频重建

Yue, Huanjing, Li, Dawei, Tu, Shaoxiong, Yang, Jingyu

Abstract

Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that $\text{F}^2\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.

Chinese Translation

从交替曝光的低动态范围（LDR）帧序列重建高动态范围（HDR）视频仍然具有很大的挑战性，特别是在动态场景下，交叉曝光不一致和复杂运动使得帧间对齐变得困难，导致鬼影和细节丢失。现有方法往往面临不准确的对齐、次优的特征聚合以及在运动主导区域重建质量下降的问题。为了解决这些挑战，我们提出了$ ext{F}^2 ext{HDR}$，一个两阶段的HDR视频重建框架，能够稳健地感知帧间运动并在复杂动态场景中恢复细节。该框架集成了一个流适配器，用于稳健的交叉曝光对齐，物理运动建模以识别显著运动区域，以及一个运动感知的细化网络，该网络在去除鬼影和噪声的同时聚合互补信息。大量实验表明，$ ext{F}^2 ext{HDR}$在真实世界HDR视频基准测试中实现了最先进的性能，在大运动和曝光变化下生成无鬼影和高保真度的结果。

View on arXiv Download PDF AI Translation

cs.CV / 281 / 2603.14925

Workflow-Aware Structured Layer Decomposition for Illustration Production

面向工作流的结构化层分解用于插图制作

Zhang, Tianyu, Li, Dongchi, Sawada, Keiichi, Xie, Haoran

Abstract

Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: https://github.com/zty0304/Anime-layer-decomposition

Chinese Translation

近期的生成图像编辑方法采用分层表示，以减轻光栅图像的纠缠特性并提高可控性，通常依赖于基于对象的分割。然而，这些策略可能无法捕捉人类创作图像的结构性和风格化特征，例如动漫插图。为了解决这一问题，我们提出了一种面向工作流的结构化层分解框架，专门针对动漫艺术作品的插图制作。受到动漫制作创作流程的启发，我们的方法将插图分解为语义上有意义的制作层，包括线条艺术、平面颜色、阴影和高光。为了解耦所有这些层，我们引入了轻量级的层语义嵌入，为每个层提供特定任务的指导。此外，我们还引入了一组层级损失，以监督各个层的训练过程。为了克服缺乏真实分层数据的问题，我们构建了一个高质量的插图数据集，模拟了标准动漫制作工作流程。实验表明，使用我们的方法可以实现准确且视觉一致的层分解。我们相信，所得到的分层表示进一步支持了下游任务，如重新上色和嵌入纹理，促进内容创作和插图编辑。代码可在以下链接获取：https://github.com/zty0304/Anime-layer-decomposition

View on arXiv Download PDF AI Translation

cs.CV / 282 / 2603.14935

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

视频事件预测的强化：事件链方法（Video-CoE）

Su, Qile, Tang, Jing, Chen, Rui, Sun, Lei, Chu, Xiangxiang

Abstract

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

Chinese Translation

尽管在多模态大语言模型（MLLMs）应用于各种视频任务方面取得了进展，视频事件预测（VEP）仍然相对未被充分探索。VEP要求模型对视频进行细粒度的时间建模，并建立视频与未来事件之间的逻辑关系，而当前的MLLMs在这方面仍然存在困难。在本研究中，我们首先对当前领先的MLLMs在VEP任务上的表现进行了全面评估，揭示了其预测不准确的原因，包括缺乏对未来事件预测的逻辑推理能力以及对视觉信息的利用不足。为了解决这些挑战，我们提出了事件链（Chain of Events，CoE）范式，该范式构建时间事件链，以隐式地促使MLLM关注视觉内容及视频与未来事件之间的逻辑连接，并通过多种训练协议激励模型的推理能力。公共基准上的实验结果表明，我们的方法在VEP任务上超越了领先的开源和商业MLLMs，建立了该任务的新最先进水平。代码和模型将很快发布。

View on arXiv Download PDF AI Translation

cs.CV / 283 / 2603.14936

Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

文本到图像扩散中的相关反馈：一种无训练且模型无关的交互框架

Wang, Wenxi, Liu, Hongbin, Li, Mingqian, Yuan, Junyan, Zhang, Junqi

Abstract

Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.

Chinese Translation

使用扩散模型进行文本到图像生成已取得显著成功。然而，用户通常具有明确的视觉意图，但难以用语言准确表达，从而导致模糊的提示和不匹配的图像。现有方法在弥合这一差距方面面临挑战，通常依赖于高负荷的文本对话、不透明的黑箱推理或昂贵的微调。它们未能同时实现低认知负荷、可解释的偏好推断，并保持无训练和模型无关。为此，我们提出了RFD，一种将信息检索中的相关反馈机制适配到扩散模型的交互框架。在RFD中，用户用隐式的多选视觉反馈替代显式的文本对话，以最小化认知负荷，轻松表达复杂的多维偏好。为了将反馈转化为精确的生成指导，我们构建了一个专家策划的特征库，并引入信息论加权累积偏好分析。这种白箱方法从当前轮次的反馈中计算偏好，并逐步累积，避免了历史交互的串联，并防止了因上下文过长而导致的推理退化。此外，RFD采用概率采样机制进行提示重构，以平衡利用与探索，防止输出同质化。重要的是，RFD完全在外部文本空间内操作，使其作为一种通用的即插即用解决方案，严格无训练且模型无关。大量实验表明，RFD有效捕捉用户的真实视觉意图，在偏好对齐方面显著优于基线。

View on arXiv Download PDF AI Translation

cs.CV / 284 / 2603.14938

FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

FAR-Drive：闭环自主驾驶中的帧自回归视频生成

Li, Yaoru, Landi, Federico, Godi, Marco, Jin, Xin, Fu, Ruiju, Ma, Yufei, Sun, Muyang, Si, Heyu, Guo, Qi

Abstract

Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.

Chinese Translation

尽管自主驾驶领域取得了快速进展，但由于缺乏可扩展和交互式的模拟环境，对驾驶系统的可靠训练和评估仍然受到根本性的限制。近期的生成视频模型在视觉保真度方面表现出色，但大多数模型适用于开环设置，无法支持智能体动作与环境演化之间的细粒度帧级交互。构建一个基于学习的闭环自主驾驶模拟器面临三个主要挑战：保持长期的时间和视角一致性，减轻在迭代自条件下的自回归衰退，以及满足低延迟推理的约束。在本研究中，我们提出FAR-Drive，这是一种用于自主驾驶的帧级自回归视频生成框架。我们引入了一种多视角扩散变换器，配备细粒度的结构性控制，从而实现几何一致的多摄像头生成。为了解决长期一致性和迭代衰退问题，我们设计了一种包含自适应参考水平条件和混合强制自回归训练的两阶段训练策略，逐步改善在自条件下的一致性和鲁棒性。为了满足低延迟交互需求，我们进一步集成了系统级效率优化以加速推理。我们在nuScenes数据集上的实验表明，我们的方法在现有闭环自主驾驶模拟方法中取得了最先进的性能，同时在单个GPU上保持亚秒延迟。

View on arXiv Download PDF AI Translation

cs.CV / 285 / 2603.14948

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

桥接场景生成与规划：通过统一视觉与运动表征实现世界模型驱动

Gui, Xingtai, Zhang, Meijie, Yan, Tianyi, Han, Wencheng, Gong, Jiahao, Tan, Feiyang, Xu, Cheng-zhong, Shen, Jianbing

Abstract

End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

Chinese Translation

端到端的自动驾驶旨在从原始传感器输入生成安全且合理的规划策略。驾驶世界模型在通过预测驾驶场景的未来演变来学习丰富的表征方面展现了巨大潜力。然而，现有的驾驶世界模型主要集中于视觉场景表征，运动表征并未明确设计为可共享和可继承的，导致视觉场景生成的优化与精确运动规划的需求之间存在鸿沟。我们提出了WorldDrive，这是一个整体框架，通过统一视觉与运动表征将场景生成与实时规划结合起来。我们首先介绍了一种轨迹感知的驾驶世界模型，该模型基于轨迹词汇进行条件化，以确保视觉动态与运动意图之间的一致性，从而使得在特定轨迹条件下生成多样且合理的未来场景成为可能。我们将视觉和运动编码器转移到下游的多模态规划器，确保驾驶策略在经过场景生成预优化的成熟表征上运行。运动表征、视觉表征与自我状态之间的简单交互可以生成高质量的多模态轨迹。此外，为了利用世界模型的前瞻性，我们提出了一种未来感知奖励机制，该机制从冻结的世界模型中提取未来潜在表征，以实时评估和选择最佳轨迹。在NAVSIM、NAVSIM-v2和nuScenes基准上的广泛实验表明，WorldDrive在视觉单一方法中实现了领先的规划性能，同时保持高保真度的动作控制视频生成能力，为统一视觉与运动表征在稳健自动驾驶中的有效性提供了有力证据。

View on arXiv Download PDF AI Translation

cs.CV / 286 / 2603.14951

GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM

GT-PCQA：基于几何-纹理解耦的点云质量评估方法与多模态大语言模型

Zhang, Guohua, Jin, Jian, Liu, Meiqin, Yao, Chao, Lin, Weisi, Zhao, Yao

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.

Chinese Translation

随着多模态大语言模型（MLLM）的快速发展，基于MLLM的图像质量评估（IQA）方法展现出良好的泛化能力。然而，直接将这些基于MLLM的IQA方法扩展到点云质量评估（PCQA）仍然面临挑战。一方面，现有的PCQA数据集规模有限，阻碍了MLLM的稳定和有效的指令调优。另一方面，由于大规模的图像-文本预训练，MLLM往往依赖于以纹理为主导的推理，对于PCQA中至关重要的几何结构退化的敏感性不足。为了解决这些问题，我们提出了一种新颖的基于MLLM的无参考PCQA框架，称为GT-PCQA，该框架基于两个关键策略。首先，为了在稀缺的PCQA监督下实现稳定和有效的指令调优，提出了一种2D-3D联合训练策略。该策略将PCQA形式化为相对质量比较问题，以统一大规模IQA数据集与有限的PCQA数据集。它结合了一种参数高效的低秩适应（LoRA）方案以支持指令调优。其次，提出了一种几何-纹理解耦策略，该策略将双提示机制与交替优化方案相结合，以减轻预训练MLLM固有的纹理主导偏差，同时增强对几何结构退化的敏感性。大量实验表明，GT-PCQA实现了竞争力的性能，并展现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 287 / 2603.14952

Pansharpening for Thin-Cloud Contaminated Remote Sensing Images: A Unified Framework and Benchmark Dataset

薄云污染遥感图像的全色融合：统一框架与基准数据集

Du, Songcheng, Zou, Yang, Li, Jiaxin, Liu, Mingxuan, Li, Ying, Shang, Changjing, Shen, Qiang

Abstract

Pansharpening under thin cloudy conditions is a practically significant yet rarely addressed task, challenged by simultaneous spatial resolution degradation and cloud-induced spectral distortions. Existing methods often address cloud removal and pansharpening sequentially, leading to cumulative errors and suboptimal performance due to the lack of joint degradation modeling. To address these challenges, we propose a Unified Pansharpening Model with Thin Cloud Removal (Pan-TCR), an end-to-end framework that integrates physical priors. Motivated by theoretical analysis in the frequency domain, we design a frequency-decoupled restoration (FDR) block that disentangles the restoration of multispectral image (MSI) features into amplitude and phase components, each guided by complementary degradation-robust prompts: the near-infrared (NIR) band amplitude for cloud-resilient restoration, and the panchromatic (PAN) phase for high-resolution structural enhancement. To ensure coherence between the two components, we further introduce an interactive inter-frequency consistency (IFC) module, enabling cross-modal refinement that enforces consistency and robustness across frequency cues. Furthermore, we introduce the first real-world thin-cloud contaminated pansharpening dataset (PanTCR-GF2), comprising paired clean and cloudy PAN-MSI images, to enable robust benchmarking under realistic conditions. Extensive experiments on real-world and synthetic datasets demonstrate the superiority and robustness of Pan-TCR, establishing a new benchmark for pansharpening under realistic atmospheric degradations.

Chinese Translation

在薄云条件下进行全色融合是一项具有实际意义但鲜有研究的任务，面临着空间分辨率下降和云引起的光谱失真等挑战。现有方法通常将云去除和全色融合分开处理，导致累积误差和由于缺乏联合退化建模而造成的次优性能。为了解决这些挑战，我们提出了一种薄云去除的统一全色融合模型（Pan-TCR），这是一个集成物理先验的端到端框架。基于频域的理论分析，我们设计了一个频率解耦恢复（FDR）模块，将多光谱图像（MSI）特征的恢复分解为幅度和相位分量，每个分量都由互补的抗退化提示引导：近红外（NIR）波段幅度用于抗云恢复，而全色（PAN）相位用于高分辨率结构增强。为了确保这两个分量之间的一致性，我们进一步引入了一个交互式频率间一致性（IFC）模块，使得跨模态的细化得以实现，从而在频率线索之间强制一致性和鲁棒性。此外，我们还引入了第一个真实世界的薄云污染全色融合数据集（PanTCR-GF2），该数据集包括成对的干净和有云的PAN-MSI图像，以便在真实条件下进行稳健的基准测试。在真实世界和合成数据集上的广泛实验表明，Pan-TCR的优越性和鲁棒性，为在真实大气退化下的全色融合建立了新的基准。

View on arXiv Download PDF AI Translation

cs.CV / 288 / 2603.14953

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

基于合成监督的学习问题感知关键帧选择用于视频问答

Kwon, Minchan, Shon, Hyounguk, Kim, Junmo

Abstract

Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.

Chinese Translation

大型多模态模型（LMMs）最近在视频问答（VideoQA）中表现出色，但由于推理成本高和信息稀释，视频推理仍然具有挑战性。关键帧选择提供了效率和更清晰的推理，但仅依赖图像-文本相似性时，面临稀疏监督和冗余帧选择的问题。我们提出了一种问题感知的关键帧选择框架，包含两个组成部分：来自LMMs的伪关键帧标签，提供了有信息量的监督，以及促进时间上多样化、互补证据的覆盖正则化。在NExT-QA上的实验表明，我们的方法显著提高了准确性，尤其是在时间和因果问题类型上，确立了关键帧选择作为VideoQA中有效且可学习的模块。

View on arXiv Download PDF AI Translation

cs.CV / 289 / 2603.14957

CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

CyCLeGen：视觉基础模型中的循环一致布局预测与图像生成

Shan, Xiaojun, Shen, Haoyu, Mao, Yucheng, Zhang, Xiang, Anand, Abhay, Li, Bingnan, Xu, Haiyang, Tu, Zhuowen

Abstract

We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

Chinese Translation

我们提出了CyCLeGen，这是一种统一的视觉-语言基础模型，能够在单一自回归框架内同时进行图像理解和图像生成。与依赖于独立模块进行感知和合成的现有视觉模型不同，CyCLeGen采用了完全集成的架构，通过图像->布局->图像和布局->图像->布局的生成循环强制执行循环一致性学习。这种统一的表述带来了两个关键优势：内省，使模型能够推理自身的生成结果，以及数据效率，允许通过在循环一致性指导下的强化学习目标下进行合成监督实现自我改进。大量实验表明，CyCLeGen在多种图像理解和生成基准测试中取得了显著的提升，突显了统一视觉-语言基础模型的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 290 / 2603.14965

GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

GeoNVS：基于几何的视角合成视频扩散

Kang, Minjun, Shin, Inkyu, Lee, Taeyeop, Kim, Myungchul, Kweon, In So, Yoon, Kuk-Jin

Abstract

Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

Chinese Translation

新视角合成需要强大的三维几何一致性以及在不同视点下生成视觉上连贯图像的能力。尽管近期的相机控制视频扩散模型展现出良好的效果，但它们往往存在几何失真和有限的相机可控性等问题。为了解决这些挑战，我们提出了GeoNVS，一种基于几何的新视角合成器，通过显式的三维几何指导来增强几何保真度和相机可控性。我们的关键创新是高斯点特征适配器（Gaussian Splat Feature Adapter，GS-Adapter），该适配器将输入视图的扩散特征提升为三维高斯表示，渲染几何约束的新视角特征，并自适应地将其与扩散特征融合，以修正几何不一致的表示。与之前在输入层注入几何信息的方法不同，GS-Adapter在特征空间中操作，避免了依赖视角的颜色噪声，从而降低了结构一致性。其即插即用的设计使其能够与各种前馈几何模型实现零-shot兼容，无需额外训练，并且可以适应其他视频扩散骨干网络。在9个场景和18个设置下的实验表明，GeoNVS达到了最先进的性能，相较于SEVA和CameraCtrl分别提高了11.3%和14.9%，并在平移误差上减少了最多2倍，在Chamfer距离上减少了7倍。

View on arXiv Download PDF AI Translation

cs.CV / 291 / 2603.14974

Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition

基于Voronoi的二阶描述符及其在激光雷达场所识别中的白化度量

Kim, Jaein, Yoo, Hee Bin, Han, Dong-Sig, Zhang, Byoung-Tak

Abstract

The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.

Chinese Translation

池化层在将局部描述符聚合为可度量的全局描述符在激光雷达场所识别（LiDAR Place Recognition，LPR）中起着至关重要的作用。特别是，二阶池化能够捕捉局部描述符之间的高阶交互。然而，LPR中现有的方法依赖于传统实现和后归一化，导致描述符不适合欧几里得距离。基于最近将NetVLAD与二阶统计关联的解释，我们提出将二阶池化与Voronoi单元的归纳偏差相结合。我们新颖的池化方法聚合局部描述符以形成二阶矩阵，并对全局描述符进行白化，以隐式测量马氏距离，同时保持来自Voronoi单元的聚类特性，从而解决多样化技术学习过程中的数值不稳定性。我们通过在牛津机器人车和Wild-Places基准上进行的实验展示了其性能提升，并分析了所提白化算法的数值效果。

View on arXiv Download PDF AI Translation

cs.CV / 292 / 2603.14989

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

MMSpec：视觉-语言模型的推测解码基准测试

Shen, Hui, Wang, Xin, Zhang, Ping, Hsieh, Yunta, Han, Qi, Wan, Zhongwei, Zhang, Ziheng, Zhang, Jingxuan, Xiong, Jing, Liu, Ziyuan, Zhang, Yifan, Cao, Hangrui, Zhao, Chenyang, Zhang, Mi

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

Chinese Translation

视觉-语言模型（VLMs）在多模态任务中表现出色，但由于模型规模大和多模态上下文长，推理延迟较高。推测解码最近作为一种有效的加速技术出现，但其在VLMs中的表现仍然不够清楚。我们引入了MMSpec，这是第一个用于评估视觉-语言模型中推测解码的基准。MMSpec包含600个跨六个任务类别的多模态样本，并在统一的评估框架下集成了十种代表性的推测解码算法。我们的研究揭示了三个关键发现：（1）为仅文本的LLMs设计的方法在多模态场景中表现不佳，（2）在较大的批量大小下，视觉意识变得越来越重要，以及（3）仅通过吞吐量加速并不能可靠地反映延迟性能。基于这些发现，我们提出了ViSkip，这是一种即插即用的推测解码方法，能够动态调整推测以适应视觉标记，并实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 293 / 2603.14998

Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3

基于递归网络的单目ORB-SLAM3热图像精细化与深度估计

Şahin, Hürkan, Pham, Huy Xuan, Dang, Van Huyen, Yegenoglu, Alper, Kayacan, Erdal

Abstract

Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.

Chinese Translation

在GPS信号缺失和视觉环境恶化的情况下，无人机（UAV）的自主导航仍然面临挑战。为此，我们研究了将单目热成像相机作为UAV平台上的独立传感器，用于实时深度估计和同时定位与地图构建（SLAM）。为了从热图像中提取深度信息，我们提出了一种新颖的流程，采用轻量级的监督网络，并集成了递归块（RBs）以捕捉时间依赖性，从而实现更强的预测能力。该网络结合了轻量级卷积骨干网络与热图像精细化网络（T-RefNet），以精细化原始热输入并增强特征可见性。精细化后的热图像和预测的深度图被整合到ORB-SLAM3中，实现仅基于热成像的定位。与之前的方法不同，该网络在自定义的非辐射度数据集上进行训练，避免了对高成本辐射度热成像相机的需求。在数据集和无人机飞行实验中的结果表明，在低光照条件下，我们的方法在深度精度和SLAM性能上具有竞争力。在辐射度VIVID++（室内暗光）数据集上，我们的方法实现了约0.06的绝对相对误差，而基线方法超过0.11。在我们的非辐射度室内数据集中，基线误差保持在0.24以上，而我们的方法则保持在0.10以下。仅基于热成像的ORB-SLAM3的平均轨迹误差低于0.4米。

View on arXiv Download PDF AI Translation

cs.CV / 294 / 2603.15003

Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning

Edit2Interp：通过少量学习将图像基础模型从空间编辑适应到视频帧插值

Rahimi, Nasrin, Yavuz, Mısra, Biner, Burak Can, Kurt, Yunus Bilge, Emirdağı, Ahmet Rasim, Aslan, Süleyman, Aydemir, Görkay, Yılmaz, M. Akın, Tekalp, A. Murat

Abstract

Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model's inherent understanding of "how objects transform" in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized

Chinese Translation

预训练的图像编辑模型展现出强大的空间推理和对象感知转换能力，这些能力来源于数十亿的图像-文本对，但它们并不具备明确的时间建模能力。本文证明，这些空间先验可以通过最小的适应性重新利用，以解锁时间合成能力——无需引入任何特定于视频的架构或运动估计模块。我们展示了一个大型图像编辑模型（Qwen-Image-Edit），最初仅设计用于静态基于指令的编辑，可以仅通过64-256个训练样本利用低秩适应（Low-Rank Adaptation, LoRA）进行视频帧插值（Video Frame Interpolation, VFI）的适应。我们的核心贡献在于揭示该模型对“对象在静态场景中如何变换”的内在理解包含潜在的时间推理，这可以通过少量微调激活。尽管基线模型在生成连贯的中间帧时完全失败，但我们的参数高效适应成功解锁了其插值能力。我们的工作并不是与从头开始在大规模数据集上训练的任务特定VFI方法竞争，而是建立了基础图像编辑模型在时间任务中未被开发的潜力，为资源受限场景下的视频合成提供了一条数据高效的途径。这弥合了图像处理与视频理解之间的差距，暗示基础模型中的空间和时间推理可能比之前认识的更加交织。

View on arXiv Download PDF AI Translation

cs.CV / 295 / 2603.15008

Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

线索至关重要：利用潜在视觉线索增强视频推理

zhang, Kaixin, Li, Xiaohe, Li, Jiahao, Wu, Haohua, Zhao, Xinyu, Fan, Zide, Wang, Lei

Abstract

Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.

Chinese Translation

多模态大型语言模型（MLLMs）在视频推理方面取得了显著进展，但由于对时间因果推理和基于证据的答案生成的需求，视频问答（VideoQA）仍然面临挑战。现有的端到端 MLLM 框架缺乏视觉感知与答案推导之间的明确结构化推理，导致严重的幻觉现象和较差的可解释性。现有方法也未能解决三个核心问题：忠实的视觉线索提取、关注实用性的线索过滤和端到端的线索-答案对齐。受到人类视觉认知层次结构的启发，我们提出了 ClueNet，一个线索感知的视频推理框架，采用两阶段的监督微调范式，而无需对基础模型进行广泛修改。解耦的监督将线索提取与基于链的推理对齐，而带有自适应线索过滤器的推理监督则优化高阶推理，同时配备轻量级模块以提高推理效率。在 NExT-QA、STAR 和 MVBench 上的实验表明，ClueNet 的表现优于最先进的方法，提升幅度达到 $ ext{≥} 1.1\%$，在泛化能力、幻觉缓解、推理效率和跨骨干兼容性方面均表现出色。本研究弥合了 MLLM 视频理解中感知与生成之间的差距，为高风险 VideoQA 应用提供了一种可解释且忠实的推理范式。

View on arXiv Download PDF AI Translation

cs.CV / 296 / 2603.15011

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

分子标识符视觉提示与可验证强化学习在化学反应图解析中的应用

Song, Jiahe, Wang, Chuang, Wang, Yinfan, Zheng, Hao, Nie, Rui, Jiang, Bowen, Wei, Xingjian, Gao, Junyuan, Wang, Yubin, Wang, Bin, Wu, Lijun, Wu, Jiang, Yu, Qian, He, Conghui

Abstract

Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.

Chinese Translation

反应图解析（RxnDP）对于从文献中提取化学合成信息至关重要。尽管近期出现的视觉-语言模型（VLMs）作为自动化这一复杂视觉推理任务的有希望的范式，但其应用在根本上受到两个因素的制约：一是无法将视觉化学实体与预训练知识对齐，二是令牌级训练与反应级评估之间的固有差异。为了解决这两个挑战，本研究从提示表示和学习范式两个互补的角度增强基于VLM的RxnDP。首先，我们提出了标识符作为视觉提示（Identifier as Visual Prompting, IdtVP），它利用自然出现的分子标识符（例如，粗体数字如1a）来激活在VLM预训练期间获得的化学知识。IdtVP使得强大的零样本和分布外能力成为可能，超越了现有的提示策略。其次，为了进一步优化微调范式下的性能，我们引入了Re3-DAPO，这是一种利用可验证奖励直接优化反应级指标的强化学习算法，从而在标准监督微调上实现了一致的提升。此外，我们发布了ScannedRxn基准数据集，包含带有真实世界伪影的扫描历史反应图，以严格评估模型的鲁棒性和分布外能力。我们的贡献推动了基于VLM的反应图解析的准确性和泛化能力。我们将在GitHub上发布数据、模型和代码。

View on arXiv Download PDF AI Translation

cs.CV / 297 / 2603.15016

Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching

黎曼运动生成：通过黎曼流匹配实现人类运动表示与生成的统一框架

Miao, Fangran, Huang, Jian, Li, Ting

Abstract

Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.

Chinese Translation

人类运动生成通常是在欧几里得空间中学习的，尽管有效的运动遵循结构化的非欧几里得几何。我们提出了黎曼运动生成（Riemannian Motion Generation, RMG），这是一个在乘积流形上表示运动并通过黎曼流匹配学习动力学的统一框架。RMG将运动分解为多个流形因子，产生无尺度的表示，并具有内在归一化，同时使用测地线插值、切空间监督和保持流形的常微分方程（ODE）积分进行训练和采样。在HumanML3D数据集上，RMG在HumanML3D格式下达到了最先进的FID（0.043），并在MotionStreamer格式下的所有报告指标中排名第一。在MotionMillion数据集上，它也超越了强基线（FID 5.6，R@1 0.86）。消融实验表明，紧凑的$ ext{T}+ ext{R}$（平移 + 旋转）表示是最稳定和有效的，突出了几何感知建模作为实现高保真运动生成的实用且可扩展的途径。

View on arXiv Download PDF AI Translation

cs.CV / 298 / 2603.15019

Reference-Free Omnidirectional Stereo Matching via Multi-View Consistency Maximization

基于多视图一致性最大化的无参考全向立体匹配

Xu, Lehuai, Zhang, Weiming, Li, Yang, Du, Sidan, Wang, Lin

Abstract

Reliable omnidirectional depth estimation from multi-fisheye stereo matching is pivotal to many applications, such as embodied robotics. Existing approaches either rely on spherical sweeping with heuristic fusion strategies to build the cost columns or perform reference-centric stereo matching based on rectified views. However, these methods fail to explicitly exploit geometric relationships between multiple views, rendering them less capable of capturing the global dependencies, visibility, or scale changes. In this paper, we shift to a new perspective and propose a novel reference-free framework, dubbed FreeOmniMVS, via multi-view consistency maximization. The highlight of FreeOmniMVS is that it can aggregate pair-wise correlations into a robust, visibility-aware, and global consensus. As such, it is tolerant to occlusions, partial overlaps, and varying baselines. Specifically, to achieve global coherence, we introduce a novel View-pair Correlation Transformer (VCT) that explicitly models pairwise correlation volumes across all camera view pairs, allowing us to drop unreliable pairs caused by occlusion or out-of-focus observations. To realize scalable and visibility-aware consensus, we propose a lightweight attention mechanism that adaptively fuses the correlation vectors, eliminating the need for a designated reference view and allowing all cameras to contribute equally to the stereo matching process. Extensive experiments on diverse benchmark datasets demonstrate the superiority of our method for globally consistent, visibility-aware, and scale-aware omnidirectional depth estimation.

Chinese Translation

从多鱼眼立体匹配中可靠地估计全向深度对于许多应用（如具身机器人）至关重要。现有方法要么依赖于使用启发式融合策略进行球面扫描以构建成本列，要么基于校正视图执行以参考为中心的立体匹配。然而，这些方法未能明确利用多个视图之间的几何关系，使其在捕捉全局依赖性、可见性或尺度变化方面能力不足。本文转变了视角，提出了一种新颖的无参考框架，称为FreeOmniMVS，通过多视图一致性最大化实现。FreeOmniMVS的亮点在于它能够将成对的相关性聚合为一种鲁棒的、具有可见性意识的全局共识。因此，它对遮挡、部分重叠和变化的基线具有容忍性。具体而言，为了实现全局一致性，我们引入了一种新颖的视对相关性变换器（View-pair Correlation Transformer, VCT），该变换器明确建模所有相机视对之间的成对相关性体积，从而使我们能够剔除因遮挡或失焦观察而导致的不可靠视对。为了实现可扩展的、具有可见性意识的共识，我们提出了一种轻量级注意力机制，能够自适应地融合相关向量，消除对指定参考视图的需求，使所有相机在立体匹配过程中均能平等贡献。对多种基准数据集的广泛实验表明，我们的方法在全局一致性、可见性意识和尺度意识的全向深度估计方面具有优越性。

View on arXiv Download PDF AI Translation

cs.CV / 299 / 2603.15020

MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal

MER-Bench：多模态表情再评估的综合基准

Nie, Yiqi, Wang, Fei, Chen, Junjie, Li, Kun, Cai, Yudi, Guo, Dan, Li, Chenglong, Wang, Meng

Abstract

Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: https://github.com/one-seven17/MER-Bench.

Chinese Translation

表情包是一种紧密结合的多模态社会表达形式，其中视觉上下文和叠加文本共同传达细腻的情感和评论。受到心理学中认知再评估的启发，我们提出了表情再评估（Meme Reappraisal），这是一项新颖的多模态生成任务，旨在将负面框架的表情包转变为建设性的表情包，同时保留其基本场景、实体和结构布局。与之前关于表情包理解或生成的研究不同，表情再评估要求在多重语义和风格约束下进行可控情感、结构保留的多模态转化。为支持这一任务，我们构建了MER-Bench，这是一个具有细粒度多模态注释的真实世界表情包基准，包括源情感和目标情感、积极重写的表情包文本、视觉编辑规范，以及涵盖视觉类型、情感极性和布局结构的分类标签。我们进一步提出了基于多模态大型语言模型（MLLM）作为评判者的结构化评估框架，将性能分解为模态级生成质量、情感可控性、结构保真度和全局情感一致性。针对代表性的图像编辑和多模态生成系统的广泛实验揭示了在满足结构保留、语义一致性和情感转化约束方面存在显著差距。我们相信MER-Bench为可控表情包编辑和情感感知多模态生成的研究奠定了基础。我们的代码可在以下链接获取：https://github.com/one-seven17/MER-Bench。

View on arXiv Download PDF AI Translation

cs.CV / 300 / 2603.15025

One CT Unified Model Training Framework to Rule All Scanning Protocols

一个统一的CT模型训练框架以适应所有扫描协议

Xu, Fengzhi, Yang, Ziyuan, Lu, Zexin, Chen, Yingyu, Fan, Fenglei, Shan, Hongming, Zhang, Yi

Abstract

Non-ideal measurement computed tomography (NICT), which lowers radiation at the cost of image quality, is expanding the clinical use of CT. Although unified models have shown promise in NICT enhancement, most methods require paired data, which is an impractical demand due to inevitable organ motion. Unsupervised approaches attempt to overcome this limitation, but their assumption of homogeneous noise neglects the variability of scanning protocols, leading to poor generalization and potential model collapse. We further observe that distinct scanning protocols, which correspond to different physical imaging processes, produce discrete sub-manifolds in the feature space, contradicting these assumptions and limiting their effectiveness. To address this, we propose an Uncertainty-Guided Manifold Smoothing (UMS) framework to bridge the gaps between sub-manifolds. A classifier in UMS identifies sub-manifolds and predicts uncertainty scores, which guide the generation of diverse samples across the entire manifold. By leveraging the classifier's capability, UMS effectively fills the gaps between discrete sub-manifolds, and promotes a continuous and dense feature space. Due to the complexity of the global manifold, it's hard to directly model it. Therefore, we propose to dynamically incorporate the global- and sub-manifold-specific features. Specifically, we design a global- and sub-manifold-driven architecture guided by the classifier, which enables dynamic adaptation to subdomain variations. This dynamic mechanism improves the network's capacity to capture both shared and domain-specific features, thereby improving reconstruction performance. Extensive experiments on public datasets are conducted to validate the effectiveness of our method across different generation paradigms.

Chinese Translation

非理想测量计算机断层扫描（NICT）在降低辐射的同时牺牲了图像质量，正在扩大CT的临床应用。尽管统一模型在NICT增强方面显示出前景，但大多数方法需要配对数据，这在不可避免的器官运动情况下是一个不切实际的要求。无监督方法试图克服这一限制，但它们对均匀噪声的假设忽视了扫描协议的变化，导致泛化能力差和潜在的模型崩溃。我们进一步观察到，不同的扫描协议对应于不同的物理成像过程，在特征空间中产生离散的子流形，这与这些假设相矛盾，限制了它们的有效性。为了解决这个问题，我们提出了一种不确定性引导的流形平滑（UMS）框架，以弥合子流形之间的差距。UMS中的分类器识别子流形并预测不确定性分数，这些分数指导在整个流形上生成多样化的样本。通过利用分类器的能力，UMS有效填补了离散子流形之间的空白，并促进了连续和密集的特征空间。由于全局流形的复杂性，直接建模是困难的。因此，我们提议动态地结合全局和子流形特征。具体而言，我们设计了一种由分类器引导的全局和子流形驱动的架构，使其能够动态适应子域变化。这种动态机制提高了网络捕捉共享特征和领域特定特征的能力，从而改善了重建性能。在公共数据集上进行的广泛实验验证了我们方法在不同生成范式中的有效性。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

113

cs.AI / 1 / 2603.13236

Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment

人类对人工智能因果性归因的理解：代理性、误用与失调

Carro, Maria Victoria, Lagnado, David

Abstract

AI-related incidents are becoming increasingly frequent and severe, ranging from safety failures to misuse by malicious actors. In such complex situations, identifying which elements caused an adverse outcome, the problem of cause selection, is a critical first step for establishing liability. This paper investigates folk perceptions of causal responsibility in causal chain structures when AI systems are involved in harmful outcomes. We conduct human experiments to examine judgments of causality, blame, foreseeability, and counterfactual reasoning. Our findings show that: (1) When AI agency was moderate (human sets the goal, AI determines the means) or high (AI sets the goal and the means), participants attributed greater causal responsibility to the AI. However, under low AI agency (where a human sets both a goal and means) participants assigned greater causal responsibility to the human despite their temporal distance from the outcome and despite both agents intended it, suggesting an effect of autonomy; (2) When we reversed roles between human and AI, participants consistently judged the human as more causal, even when both agents perform the same action; (3) The developer, despite being distant in the chain, was judged highly causal, reducing causal attributions to the human user but not to the AI; (4) Decomposing the AI into a large language model and an agentic component showed that the agentic part was judged as more causal in the chain. Overall, our research provides evidence on how people perceive the causal contribution of AI in both misuse and misalignment scenarios, and how these judgments interact with the roles of users and developers, key actors in assigning responsibility. These findings can inform the design of liability frameworks for AI-caused harms and shed light on how intuitive judgments shape social and policy debates surrounding real-world AI-related incidents.

Chinese Translation

与人工智能相关的事件日益频繁且严重，从安全失败到恶意行为者的误用。在这样的复杂情况下，识别哪些要素导致了不良结果，即因果选择问题，是建立责任的关键第一步。本文研究了当人工智能系统涉及有害结果时，公众对因果链结构中的因果责任的感知。我们进行人类实验，以检查因果性、指责、可预见性和反事实推理的判断。我们的研究发现：（1）当人工智能的代理性处于中等水平（由人类设定目标，人工智能决定手段）或高水平（人工智能设定目标和手段）时，参与者将更大的因果责任归因于人工智能。然而，在低代理性下（由人类同时设定目标和手段），参与者虽然距离结果发生有时间上的延迟，但仍将更大的因果责任归于人类，尽管两个代理都是有意为之，这表明了自治的影响；（2）当我们反转人类与人工智能的角色时，参与者一致认为人类更具因果性，即使两个代理执行相同的行动；（3）尽管开发者在链中相对遥远，但仍被认为具有高度的因果性，从而减少了对人类用户的因果归因，而对人工智能的归因则未受影响；（4）将人工智能拆解为大型语言模型和具有代理性的组件显示，具有代理性的部分在因果链中被认为更具因果性。总体而言，我们的研究提供了证据，说明人们如何感知人工智能在误用和失调场景中的因果贡献，以及这些判断如何与用户和开发者的角色相互作用，这些都是在界定责任时的重要参与者。这些发现可以为设计人工智能引发的伤害的责任框架提供指导，并阐明直觉判断如何塑造围绕现实世界人工智能相关事件的社会和政策辩论。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2603.13237

A Dual-Path Generative Framework for Zero-Day Fraud Detection in Banking Systems

用于银行系统零日欺诈检测的双路径生成框架

Ismail, Nasim Abdirahman, Karaarslan, Enis

Abstract

High-frequency banking environments face a critical trade-off between low-latency fraud detection and the regulatory explainability demanded by GDPR. Traditional rule-based and discriminative models struggle with "zero-day" attacks due to extreme class imbalance and the lack of historical precedents. This paper proposes a Dual-Path Generative Framework that decouples real-time anomaly detection from offline adversarial training. The architecture employs a Variational Autoencoder (VAE) to establish a legitimate transaction manifold based on reconstruction error, ensuring <50ms inference latency. In parallel, an asynchronous Wasserstein GAN with Gradient Penalty (WGAN-GP) synthesizes high-entropy fraudulent scenarios to stress-test the detection boundaries. Crucially, to address the non-differentiability of discrete banking data (e.g., Merchant Category Codes), we integrate a Gumbel-Softmax estimator. Furthermore, we introduce a trigger-based explainability mechanism where SHAP (Shapley Additive Explanations) is activated only for high-uncertainty transactions, reconciling the computational cost of XAI with real-time throughput requirements.

Chinese Translation

高频银行环境面临低延迟欺诈检测与GDPR要求的监管可解释性之间的关键权衡。传统的基于规则和判别模型在面对“零日”攻击时，由于极端类别不平衡和缺乏历史先例而难以应对。本文提出了一种双路径生成框架，解耦实时异常检测与离线对抗训练。该架构采用变分自编码器（Variational Autoencoder, VAE）根据重建误差建立合法交易流形，确保推理延迟小于50毫秒。同时，采用异步Wasserstein GAN与梯度惩罚（WGAN-GP）合成高熵欺诈场景，以压力测试检测边界。关键是，为了解决离散银行数据（例如商户类别代码）的非可微性问题，我们集成了一种Gumbel-Softmax估计器。此外，我们引入了一种基于触发的可解释性机制，仅对高不确定性交易激活SHAP（Shapley加性解释），以平衡计算开销与实时吞吐量要求。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2603.13239

Benchmarking Zero-Shot Reasoning Approaches for Error Detection in Solidity Smart Contracts

针对Solidity智能合约错误检测的零样本推理方法基准评估

Sardenberg, Eduardo, Busson, Antonio José Grandson, Moraes, Daniel de Sousa, Colcher, Sérgio

Abstract

Smart contracts play a central role in blockchain systems by encoding financial and operational logic. Still, their susceptibility to subtle security flaws poses significant risks of financial loss and erosion of trust. LLMs create new opportunities for automating vulnerability detection, yet the effectiveness of different prompting strategies and model choices in real-world contexts remains uncertain. This paper evaluates state-of-the-art LLMs on Solidity smart contract analysis using a balanced dataset of 400 contracts under two tasks: (i) Error Detection, where the model performs binary classification to decide whether a contract is vulnerable, and (ii) Error Classification, where the model must assign the predicted issue to a specific vulnerability category. Models are evaluated using zero-shot prompting strategies, including zero-shot, zero-shot Chain-of-Thought (CoT), and zero-shot Tree-of-Thought (ToT). In the Error Detection task, CoT and ToT substantially increase recall (often approaching $\approx 95$--$99\%$), but typically reduce precision, indicating a more sensitive decision regime with more false positives. In the Error Classification task, Claude 3 Opus attains the best Weighted F1-score (90.8) under the ToT prompt, followed closely by its CoT.

Chinese Translation

智能合约在区块链系统中扮演着核心角色，通过编码金融和操作逻辑。然而，它们对微妙安全缺陷的易感性带来了显著的财务损失风险和信任侵蚀。大型语言模型（LLMs）为自动化漏洞检测创造了新的机会，但在现实世界背景下，不同提示策略和模型选择的有效性仍然不确定。本文评估了最先进的LLMs在Solidity智能合约分析中的表现，使用了一个包含400个合约的平衡数据集，涉及两个任务：（i）错误检测，其中模型执行二分类以决定合约是否存在漏洞，以及（ii）错误分类，其中模型必须将预测的问题分配到特定的漏洞类别。模型使用零样本提示策略进行评估，包括零样本、零样本思维链（Chain-of-Thought, CoT）和零样本思维树（Tree-of-Thought, ToT）。在错误检测任务中，CoT和ToT显著提高了召回率（通常接近95%--99%），但通常降低了精确率，表明决策机制更加敏感，假阳性增多。在错误分类任务中，Claude 3 Opus在ToT提示下获得了最佳加权F1分数（90.8），紧随其后的是其CoT表现。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2603.13243

Think First, Diffuse Fast: Improving Diffusion Language Model Reasoning via Autoregressive Plan Conditioning

先思考，再快速扩散：通过自回归计划条件改善扩散语言模型推理

Sauver, Earl J St

Abstract

Diffusion large language models (dLLMs) generate text via iterative denoising but consistently underperform on multi-step reasoning. We hypothesize this gap stems from a coordination problem: AR models build coherence token-by-token, while diffusion models must coordinate all positions simultaneously. We propose plan conditioning, a training-free method that prepends a short (~100-token) natural-language plan from an AR model to the diffusion model's prompt. The plan serves as a frozen scaffold -- globally visible context that every token position can attend to from the first denoising step. On GSM8K, plan conditioning improves LLaDA-8B-Instruct from 75.6% to 87.2% (+11.6 percentage points), matching a same-size AR model (LLaMA 3.1 8B, 87.7%) despite a 6.4pp weaker baseline. On HumanEval, the gain is +12.8pp (37.2% to 50.0%), showing plans generalize to code. The same plans improve LLaMA by only +5.7pp on GSM8K and +1.3pp on HumanEval -- diffusion models benefit 2-10x more, supporting the coordination-problem hypothesis. Across 5 random seeds, plan-conditioned GSM8K accuracy has zero standard deviation, making diffusion inference highly stable. Ablations reveal the model follows plan strategy (wrong-strategy plans cause -16.3pp) but is robust to plan values (perturbed numbers: -1.1pp), and that planner quality has a sharp threshold: smaller Llama-class plans hurt (-1.6 to -6.8pp) while frontier plans provide the full lift. Attention analysis confirms the mechanism: plan tokens receive 1.8x excess attention during early denoising, declining to uniform as completion tokens solidify. Plan conditioning costs ~$0.002 per problem and adds ~2s of latency.

Chinese Translation

扩散大型语言模型（dLLMs）通过迭代去噪生成文本，但在多步骤推理上表现不佳。我们假设这一差距源于协调问题：自回归（AR）模型逐个标记地构建一致性，而扩散模型必须同时协调所有位置。我们提出了一种计划条件的方法，这是一种无训练的技术，它将来自自回归模型的简短（约100个标记）自然语言计划预先添加到扩散模型的提示中。该计划作为一个固定的支架——一个全球可见的上下文，所有标记位置可以从第一个去噪步骤开始关注。通过在GSM8K数据集上的实验，计划条件使LLaDA-8B-Instruct的准确率从75.6%提高到87.2%（增加11.6个百分点），尽管基线表现弱6.4个百分点，但与同规模的自回归模型（LLaMA 3.1 8B，87.7%）相匹配。在HumanEval上，增益为12.8个百分点（从37.2%提高到50.0%），显示计划在代码生成中具有普适性。相同的计划在GSM8K上仅使LLaMA提高了5.7个百分点，在HumanEval上提高了1.3个百分点——扩散模型的收益是自回归模型的2-10倍，支持了协调问题假设。在5个随机种子下，计划条件的GSM8K准确率标准差为零，使扩散推理高度稳定。消融实验表明，模型遵循计划策略（错误策略的计划导致-16.3个百分点），但对计划值具有鲁棒性（扰动数字：-1.1个百分点），并且计划者质量具有明显的阈值：较小的Llama类计划会造成负面影响（-1.6到-6.8个百分点），而前沿计划则提供了全部提升。注意力分析证实了这一机制：在早期去噪过程中，计划标记获得了1.8倍的过量注意力，随着完成标记的固化，注意力逐渐趋于均匀。计划条件的成本约为每个问题$0.002，并增加约2秒的延迟。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2603.13245

Automating Document Intelligence in Statutory City Planning

在法定城市规划中自动化文档智能

Malmqvist, Lars, Barber, Robin

Abstract

UK planning authorities face a legislative conflict between the Planning Act, which mandates public access to application documents, and the Data Protection Act, which requires protection of personal information. This situation creates a manually intensive workload for processing large document volumes, diverting planning officers to administrative tasks and creating legal compliance risks. This paper presents an integrated AI system designed to address these challenges. The system automates the identification and redaction of personal information, extracts key metadata from planning documents, and analyzes architectural drawings for specified features. It operates with an AI-in-the-Loop (AI2L) design, presenting all suggestions for review and confirmation by planning officers directly within their existing software; no action is committed without explicit human approval. The system is designed to improve its performance over time by learning from this human oversight through active learning prioritization rather than autoapproval. The system is currently being piloted at four diverse UK local authorities. The paper details the system design, the AI2L workflow, and the evaluation framework used in the pilot. Additionally, it describes a preliminary Return on Investment (ROI) model developed to quantify potential savings and secure partner participation. This work provides a case study on deploying AI to reduce administrative burden and manage compliance risk in a public sector environment.

Chinese Translation

英国规划机构面临《规划法》和《数据保护法》之间的立法冲突，《规划法》要求公众获取申请文件，而《数据保护法》则要求保护个人信息。这种情况导致在处理大量文档时需要手动投入大量工作，分散了规划官员的注意力，使其转向行政任务，并增加了法律合规风险。本文提出了一种集成的人工智能系统，旨在解决这些挑战。该系统自动识别和编辑个人信息，从规划文件中提取关键元数据，并分析建筑图纸中的特定特征。它采用了AI-in-the-Loop (AI2L)设计，所有建议都在现有软件中直接呈现给规划官员进行审核和确认；在没有明确人类批准的情况下，不会执行任何操作。该系统旨在通过主动学习优先级而非自动批准，随着时间的推移，通过学习这种人类监督来提高其性能。该系统目前正在四个不同的英国地方政府进行试点。本文详细介绍了系统设计、AI2L工作流程以及试点中使用的评估框架。此外，还描述了一个初步的投资回报率（ROI）模型，该模型旨在量化潜在节省并确保合作伙伴参与。这项工作提供了一个案例研究，展示了如何在公共部门环境中部署人工智能以减少行政负担和管理合规风险。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2603.13246

Multi-Axis Trust Modeling for Interpretable Account Hijacking Detection

多轴信任建模用于可解释的账户劫持检测

AL-Smadi, Mohammad

Abstract

This paper proposes a Hadith-inspired multi-axis trust modeling framework, motivated by a structurally analogous problem in classical Hadith scholarship: assessing the trustworthiness of information sources using interpretable, multidimensional criteria rather than a single anomaly score. We translate five trust axes - long-term integrity (adalah), behavioral precision (dabt), contextual continuity (isnad), cumulative reputation, and anomaly evidence - into a compact set of 26 semantically meaningful behavioral features for user accounts. In addition, we introduce lightweight temporal features that capture short-horizon changes in these trust signals across consecutive activity windows. We evaluate the framework on the CLUE-LDS cloud activity dataset with injected account hijacking scenarios. On 23,094 sliding windows, a Random Forest trained on the trust features achieves near-perfect detection performance, substantially outperforming models based on raw event counts, minimal statistical baselines, and unsupervised anomaly detection. Temporal features provide modest but consistent gains on CLUE-LDS, confirming their compatibility with the static trust representation. To assess robustness under more challenging conditions, we further evaluate the approach on the CERT Insider Threat Test Dataset r6.2, which exhibits extreme class imbalance and sparse malicious behavior. On a 500-user CERT subset, temporal features improve ROC-AUC from 0.776 to 0.844. On a leakage-controlled 4,000-user configuration, temporal modeling yields a substantial and consistent improvement over static trust features alone (ROC-AUC 0.627 to 0.715; PR-AUC 0.072 to 0.264).

Chinese Translation

本文提出了一种受教义学启发的多轴信任建模框架，其动机源于经典教义学研究中的一个结构相似问题：使用可解释的多维标准评估信息来源的可信度，而不是仅仅依赖单一的异常分数。我们将五个信任轴线——长期诚信（adalah）、行为精确性（dabt）、上下文连续性（isnad）、累积声誉和异常证据——转化为一组包含26个语义明确的用户账户行为特征。此外，我们还引入了轻量级的时间特征，捕捉这些信任信号在连续活动窗口中的短期变化。我们在注入账户劫持场景的CLUE-LDS云活动数据集上评估了该框架。在23,094个滑动窗口上，基于信任特征训练的随机森林模型实现了近乎完美的检测性能，显著优于基于原始事件计数的模型、最小统计基线和无监督异常检测。时间特征在CLUE-LDS数据集上提供了适度但持续的增益，确认了其与静态信任表示之间的兼容性。为了在更具挑战性的条件下评估鲁棒性，我们进一步在CERT内部威胁测试数据集r6.2上评估该方法，该数据集表现出极端类别不平衡和稀疏的恶意行为。在500用户的CERT子集上，时间特征提高了ROC-AUC从0.776到0.844。在一个控制泄漏的4000用户配置中，时间建模相比单独的静态信任特征显著且持续地改善了性能（ROC-AUC从0.627提高到0.715；PR-AUC从0.072提高到0.264）。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2603.13247

ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems

ILION：面向自主人工智能系统的确定性预执行安全门

Chitan, Florin Adrian

Abstract

The proliferation of autonomous AI agents capable of executing real-world actions - filesystem operations, API calls, database modifications, financial transactions - introduces a class of safety risk not addressed by existing content-moderation infrastructure. Current text-safety systems evaluate linguistic content for harm categories such as violence, hate speech, and sexual content; they are architecturally unsuitable for evaluating whether a proposed action falls within an agent's authorized operational scope. We present ILION (Intelligent Logic Identity Operations Network), a deterministic execution gate for agentic AI systems. ILION employs a five-component cascade architecture - Transient Identity Imprint (TII), Semantic Vector Reference Frame (SVRF), Identity Drift Control (IDC), Identity Resonance Score (IRS) and Consensus Veto Layer (CVL) - to classify proposed agent actions as BLOCK or ALLOW without statistical training or API dependencies. The system requires zero labeled data, operates in sub-millisecond latency, and produces fully interpretable verdicts. We evaluate ILION on ILION-Bench v2, a purpose-built benchmark of 380 test scenarios across eight attack categories with 39% hard-difficulty adversarial cases and a held-out development split. ILION achieves F1 = 0.8515, precision = 91.0%, and a false positive rate of 7.9% at a mean latency of 143 microseconds. Comparative evaluation against three baselines - Lakera Guard (F1 = 0.8087), OpenAI Moderation API (F1 = 0.1188), and Llama Guard 3 (F1 = 0.0105) - demonstrates that existing text-safety infrastructure systematically fails on agent execution safety tasks due to a fundamental task mismatch. ILION outperforms the best commercial baseline by 4.3 F1 points while operating 2,000 times faster with a false positive rate four times lower.

Chinese Translation

自主人工智能代理的快速普及，使其能够执行现实世界的操作——文件系统操作、API调用、数据库修改、金融交易——引入了一类现有内容审核基础设施未能解决的安全风险。当前的文本安全系统评估语言内容的危害类别，如暴力、仇恨言论和色情内容；它们在架构上不适合评估提议的行动是否在代理的授权操作范围内。我们提出了ILION（智能逻辑身份操作网络），这是一个面向自主人工智能系统的确定性执行门。ILION采用五个组件的级联架构——瞬态身份印记（Transient Identity Imprint, TII）、语义向量参考框架（Semantic Vector Reference Frame, SVRF）、身份漂移控制（Identity Drift Control, IDC）、身份共鸣评分（Identity Resonance Score, IRS）和共识否决层（Consensus Veto Layer, CVL）——对提议的代理操作进行分类，标记为阻止（BLOCK）或允许（ALLOW），而无需统计训练或API依赖。该系统不需要标记数据，以亚毫秒的延迟运行，并生成完全可解释的裁决。我们在ILION-Bench v2上评估ILION，这是一个专门构建的基准，包含380个测试场景，涵盖八个攻击类别，其中39%的对抗案例难度较高，并有一个保留的开发集。ILION的F1得分为0.8515，精确率为91.0%，假阳性率为7.9%，平均延迟为143微秒。与三个基线进行的比较评估——Lakera Guard（F1 = 0.8087）、OpenAI Moderation API（F1 = 0.1188）和Llama Guard 3（F1 = 0.0105）——表明现有的文本安全基础设施在代理执行安全任务上系统性失败，原因在于任务的根本不匹配。ILION在性能上超越了最佳商业基线4.3个F1点，同时运行速度快2000倍，假阳性率低四倍。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2603.13251

ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

ManiBench：用于测试 Manim 代码生成中的视觉逻辑漂移和语法幻觉的基准测试

Oli, Nabin

Abstract

Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An open-source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at https://github.com/nabin2004/ManiBench. and the dataset is hosted on https://huggingface.co/datasets/nabin2004/ManiBench.

Chinese Translation

传统基准测试如 HumanEval 和 MBPP 在逻辑和语法测试方面表现良好，但在代码需要生成动态教学视觉效果时则显得不足。我们提出了 ManiBench，这是一个专门的基准测试，用于评估大型语言模型（LLM）在生成 Manim CE 代码时的表现，其中时间保真度和版本感知的 API 正确性至关重要。ManiBench 主要针对两种关键失败模式：语法幻觉（有效的 Python 代码引用不存在或已弃用的 Manim API）和视觉逻辑漂移（生成的视觉效果因时间错误或缺失因果关系而偏离预期的数学逻辑）。该基准测试包含 150-200 道题目，覆盖微积分、线性代数、概率论、拓扑学和人工智能五个难度级别，基于对 3Blue1Brown 的 ManimGL 源代码（53,000 行，143 个场景类）的分析。评估采用四级框架，测量可执行性、版本冲突错误率、对齐得分和覆盖得分。一个开源框架自动化了对多个模型和提示策略的评估。代码、数据和基准测试套件可在 https://github.com/nabin2004/ManiBench 获取，数据集托管在 https://huggingface.co/datasets/nabin2004/ManiBench。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2603.13252

When Alpha Breaks: Two-Level Uncertainty for Safe Deployment of Cross-Sectional Stock Rankers

当阿尔法失效时：跨部门股票排名安全部署的双层不确定性

Sanderink, Ursina

Abstract

Cross-sectional ranking models are often deployed as if point predictions were sufficient: the model outputs scores and the portfolio follows the induced ordering. Under non-stationarity, rankers can fail during regime shifts. In the AI Stock Forecaster, a LightGBM ranker performs well overall at a 20-day horizon, yet the 2024 holdout coincides with an AI thematic rally and sector rotation that breaks the signal at longer horizons and weakens 20d. This motivates treating deployment as two decisions: (i) whether the strategy should trade at all, and (ii) how to control risk within active trades. We adapt Direct Epistemic Uncertainty Prediction (DEUP) to ranking by predicting rank displacement and defining an epistemic uncertainty signal ehat relative to a point-in-time (PIT-safe) baseline. Empirically, ehat is structurally coupled with signal strength (median correlation between ehat and absolute score is about 0.6 across 1,865 dates), so inverse-uncertainty sizing de-levers the strongest signals and degrades performance. To address this, we propose a two-level deployment policy: a strategy-level regime-trust gate G(t) that decides whether to trade (AUROC around 0.72 overall and 0.75 in FINAL) and a position-level epistemic tail-risk cap that reduces exposure only for the most uncertain predictions. The operational policy, trade only when G(t) is at least 0.2, apply volatility sizing on active dates, and cap the top epistemic tail, improves risk-adjusted performance in the 20d policy comparison and indicates DEUP adds value mainly as a tail-risk guard rather than a continuous sizing denominator.

Chinese Translation

跨部门排名模型通常被部署为点预测足够：模型输出分数，投资组合遵循所引导的排序。在非平稳性下，排名模型可能在制度转变期间失效。在AI股票预测器中，LightGBM排名模型在20天的预测期内整体表现良好，但2024年的保留样本恰逢AI主题反弹和行业轮换，这在较长的预测期内打破了信号并削弱了20天的表现。这促使我们将部署视为两个决策：（i）策略是否应该进行交易，以及（ii）如何在主动交易中控制风险。我们将直接认知不确定性预测（Direct Epistemic Uncertainty Prediction, DEUP）适应于排名，通过预测排名位移并定义相对于时间点（PIT-safe）基线的认知不确定性信号ehat。实证结果表明，ehat与信号强度在结构上耦合（ehat与绝对分数之间的中位相关性在1,865个日期中约为0.6），因此反向不确定性规模会削弱最强信号并降低表现。为了解决这个问题，我们提出了一种双层部署策略：一个策略级别的制度信任门G(t)，决定是否进行交易（整体AUROC约为0.72，最终为0.75），以及一个头寸级别的认知尾部风险上限，仅减少最不确定预测的敞口。操作政策为：仅在G(t)至少为0.2时进行交易，在活跃日期应用波动性规模，并限制最高的认知尾部，这在20天的政策比较中改善了风险调整后的表现，并表明DEUP主要作为尾部风险保护而非持续规模分母增加了价值。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2603.13257

Distilling Deep Reinforcement Learning into Interpretable Fuzzy Rules: An Explainable AI Framework

将深度强化学习提炼为可解释的模糊规则：一个可解释的人工智能框架

Araballi, Sanup S., Khan, Simon, Mohan, Chilukuri K.

Abstract

Deep Reinforcement Learning (DRL) agents achieve remarkable performance in continuous control but remain opaque, hindering deployment in safety-critical domains. Existing explainability methods either provide only local insights (SHAP, LIME) or employ over-simplified surrogates failing to capture continuous dynamics (decision trees). This work proposes a Hierarchical Takagi-Sugeno-Kang (TSK) Fuzzy Classifier System (FCS) distilling neural policies into human-readable IF-THEN rules through K-Means clustering for state partitioning and Ridge Regression for local action inference. Three quantifiable metrics are introduced: Fuzzy Rule Activation Density (FRAD) measuring explanation focus, Fuzzy Set Coverage (FSC) validating vocabulary completeness, and Action Space Granularity (ASG) assessing control mode diversity. Dynamic Time Warping (DTW) validates temporal behavioral fidelity. Empirical evaluation on \textit{Lunar Lander(Continuous)} shows the Triangular membership function variant achieves 81.48\% $\pm$ 0.43\% fidelity, outperforming Decision Trees by 21 percentage points. The framework exhibits statistically superior interpretability (FRAD = 0.814 vs. 0.723 for Gaussian, $p < 0.001$) with low MSE (0.0053) and DTW distance (1.05). Extracted rules such as ``IF lander drifting left at high altitude THEN apply upward thrust with rightward correction'' enable human verification, establishing a pathway toward trustworthy autonomous systems.

Chinese Translation

深度强化学习（DRL）代理在连续控制中表现出色，但仍然不透明，阻碍了在安全关键领域的应用。现有的可解释性方法要么仅提供局部洞察（如SHAP、LIME），要么采用过于简化的替代模型，无法捕捉连续动态（如决策树）。本研究提出了一种分层Takagi-Sugeno-Kang（TSK）模糊分类系统（FCS），通过K-Means聚类进行状态划分，并通过岭回归进行局部动作推理，将神经策略提炼为人类可读的IF-THEN规则。引入了三种可量化指标：模糊规则激活密度（FRAD）用于测量解释集中度，模糊集合覆盖度（FSC）用于验证词汇的完整性，以及动作空间粒度（ASG）用于评估控制模式的多样性。动态时间规整（DTW）用于验证时间行为的保真度。对 extit{Lunar Lander（连续控制）}的实证评估表明，三角形隶属函数变体的保真度达81.48 ext{%} ± 0.43 ext{%}，比决策树高出21个百分点。该框架展示了统计上优越的可解释性（FRAD = 0.814，对比高斯的0.723，$p < 0.001$），且均方误差（MSE）低（0.0053）以及DTW距离（1.05）小。提取的规则，如“如果着陆器在高高度漂向左侧，则施加向上的推力并进行右侧修正”，使人类能够进行验证，为可信的自主系统建立了道路。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2603.13261

Deep Convolutional Architectures for EEG Classification: A Comparative Study with Temporal Augmentation and Confidence-Based Voting

用于脑电图分类的深度卷积架构：带有时间增强和基于置信度投票的比较研究

Patodiya, Aryan, Cecotti, Hubert

Abstract

Electroencephalography (EEG) classification plays a key role in brain-computer interface (BCI) systems, yet it remains challenging due to the low signal-to-noise ratio, temporal variability of neural responses, and limited data availability. In this paper, we present a comparative study of deep learning architectures for classifying event-related potentials (ERPs) in EEG signals. The preprocessing pipeline includes bandpass filtering, spatial filtering, and normalization. We design and compare three main pipelines: a 2D convolutional neural network (CNN) using Common Spatial Pattern (CSP), a second 2D CNN trained directly on raw data for a fair comparison, and a 3D CNN that jointly models spatiotemporal representations. To address ERP latency variations, we introduce a temporal shift augmentation strategy during training. At inference time, we employ a confidence-based test-time voting mechanism to improve prediction stability across shifted trials. An experimental evaluation on a stratified five-fold cross-validation protocol demonstrates that while CSP provides a benefit to the 2D architecture, the proposed 3D CNN significantly outperforms both 2D variants in terms of AUC and balanced accuracy. These findings highlight the effectiveness of temporal-aware architectures and augmentation strategies for robust EEG signal classification.

Chinese Translation

脑电图（EEG）分类在脑-机接口（BCI）系统中发挥着关键作用，但由于信噪比低、神经反应的时间变异性以及数据可用性有限，仍然具有挑战性。本文呈现了对用于分类脑电图信号中事件相关电位（ERP）的深度学习架构的比较研究。预处理流程包括带通滤波、空间滤波和归一化。我们设计并比较了三条主要流程：使用常见空间模式（CSP）的二维卷积神经网络（CNN）、直接在原始数据上训练的第二个二维CNN以进行公平比较，以及联合建模时空表示的三维CNN。为了解决ERP延迟变化的问题，我们在训练期间引入了一种时间位移增强策略。在推理阶段，我们采用基于置信度的测试时间投票机制，以提高在位移试验中的预测稳定性。在分层五折交叉验证协议上的实验评估表明，尽管CSP对二维架构有益，但所提出的三维CNN在AUC和均衡准确率方面显著优于两个二维变体。这些发现突显了时间感知架构和增强策略在稳健的脑电图信号分类中的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2603.13266

Multi-hop Reasoning and Retrieval in Embedding Space: Leveraging Large Language Models with Knowledge

嵌入空间中的多跳推理与检索：利用大型语言模型与知识

Liu, Lihui

Abstract

As large language models (LLMs) continue to grow in size, their abilities to tackle complex tasks have significantly improved. However, issues such as hallucination and the lack of up-to-date knowledge largely remain unresolved. Knowledge graphs (KGs), which serve as symbolic representations of real-world knowledge, offer a reliable source for enhancing reasoning. Integrating KG retrieval into LLMs can therefore strengthen their reasoning by providing dependable knowledge. Nevertheless, due to limited understanding of the underlying knowledge graph, LLMs may struggle with queries that have multiple interpretations. Additionally, the incompleteness and noise within knowledge graphs may result in retrieval failures. To address these challenges, we propose an embedding-based retrieval reasoning framework EMBRAG. In this approach, the model first generates multiple logical rules grounded in knowledge graphs based on the input query. These rules are then applied to reasoning in the embedding space, guided by the knowledge graph, ensuring more robust and accurate reasoning. A reranker model further interprets these rules and refines the results. Extensive experiments on two benchmark KGQA datasets demonstrate that our approach achieves the new state-of-the-art performance in KG reasoning tasks.

Chinese Translation

随着大型语言模型（LLMs）规模的不断扩大，它们处理复杂任务的能力显著提高。然而，幻觉现象和缺乏最新知识等问题仍然在很大程度上未得到解决。知识图谱（KGs）作为现实世界知识的符号表示，提供了一个可靠的增强推理的来源。因此，将知识图谱检索集成到LLMs中可以通过提供可靠的知识来增强其推理能力。然而，由于对基础知识图谱的理解有限，LLMs可能在处理具有多重解释的查询时遇到困难。此外，知识图谱中的不完整性和噪声可能导致检索失败。为了解决这些挑战，我们提出了一种基于嵌入的检索推理框架EMBRAG。在此方法中，模型首先根据输入查询生成多个基于知识图谱的逻辑规则。然后，这些规则被应用于嵌入空间中的推理，受到知识图谱的指导，从而确保更稳健和准确的推理。一个重排序模型进一步解释这些规则并优化结果。在两个基准KGQA数据集上的广泛实验表明，我们的方法在KG推理任务中达到了新的最先进性能。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2603.13288

Agent-Based User-Adaptive Filtering for Categorized Harassing Communication

基于代理的用户自适应过滤分类骚扰通信

Rahaman, Zenefa, Sen, Sandip

Abstract

We propose an agent-based framework for personalized filtering of categorized harassing communication in online social networks. Unlike global moderation systems that apply uniform filtering rules, our approach models user-specific tolerance levels and preferences through adaptive filtering agents. These agents learn from user feedback and dynamically adjust filtering thresholds across multiple harassment categories, including offensive, abusive, and hateful content. We implement and evaluate the framework using supervised classification techniques and simulated user interaction data. Experimental results demonstrate that adaptive agents improve filtering precision and user satisfaction compared to static models. The proposed system illustrates how agent-based personalization can enhance content moderation while preserving user autonomy in digital social environments.

Chinese Translation

我们提出了一种基于代理的框架，用于在线社交网络中对分类骚扰通信的个性化过滤。与应用统一过滤规则的全球监管系统不同，我们的方法通过自适应过滤代理来建模用户特定的容忍水平和偏好。这些代理通过用户反馈进行学习，并在多个骚扰类别（包括攻击性、辱骂性和仇恨内容）中动态调整过滤阈值。我们使用监督分类技术和模拟用户交互数据来实现和评估该框架。实验结果表明，与静态模型相比，自适应代理提高了过滤精度和用户满意度。所提出的系统展示了基于代理的个性化如何在增强内容监管的同时保持用户在数字社交环境中的自主性。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2603.13327

DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation

DOVA：自主研究自动化中的先行思考多智能体编排

Shen, Aaron, Shen, Alfred

Abstract

Large language model (LLM) agents have demonstrated remarkable capabilities in tool use, reasoning, and code generation, yet single-agent systems exhibit fundamental limitations when confronted with complex research tasks demanding multi-source synthesis, adversarial verification, and personalized delivery. We present DOVA (Deep Orchestrated Versatile Agent), a multi-agent platform introducing three key innovations: (1) deliberation-first orchestration, where explicit meta-reasoning precedes tool invocation, informed by a persistent user model and entity-aware conversation context; (2) hybrid collaborative reasoning, a composable three-phase pipeline unifying ensemble diversity, blackboard transparency, and iterative refinement; and (3) adaptive multi-tiered thinking, a six-level token-budget allocation scheme that reduces inference cost by 40-60% on simple tasks while preserving deep reasoning capacity. We formalize the core algorithms, present an architectural ablation study across seven system configurations, and analyze the contribution of each component to answer confidence, source coverage, and token efficiency.

Chinese Translation

大型语言模型（LLM）代理在工具使用、推理和代码生成方面表现出了卓越的能力，但当面对需要多源合成、对抗验证和个性化交付的复杂研究任务时，单一代理系统表现出根本性的局限性。我们提出了DOVA（深度编排多功能代理），这是一个多代理平台，介绍了三项关键创新：(1) 先行思考编排，其中明确的元推理在工具调用之前进行，且受持久用户模型和实体感知对话上下文的指导；(2) 混合协作推理，这是一种可组合的三阶段流程，统一了集成多样性、黑板透明度和迭代优化；(3) 自适应多层思维，这是一种六级令牌预算分配方案，在简单任务上将推理成本降低了40-60%，同时保持深度推理能力。我们对核心算法进行了形式化，呈现了七种系统配置下的架构消融研究，分析了每个组件对答案置信度、源覆盖和令牌效率的贡献。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2603.13331

Why Grokking Takes So Long: A First-Principles Theory of Representational Phase Transitions

为何 Grokking 需要如此长的时间：一种基于第一原理的表征相变理论

Khanh, Truong Xuan, Hoa, Truong Quynh, Trung, Luu Duc, Duc, Phan Thanh

Abstract

Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior. We present a first-principles theory showing that grokking arises from a norm-driven representational phase transition in regularized training dynamics. Training first converges to a high-norm memorization solution and only later contracts toward a lower-norm structured representation that generalizes. Our main result establishes a scaling law for the delay: T_grok - T_mem = Theta((1 / gamma_eff) * log(||theta_mem||^2 / ||theta_post||^2)), where gamma_eff is the effective contraction rate of the optimizer (gamma_eff = eta * lambda for SGD and gamma_eff >= eta * lambda for AdamW). The upper bound follows from a discrete Lyapunov contraction argument, and the matching lower bound arises from dynamical constraints of regularized first-order optimization. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity tasks, we confirm three predictions: inverse scaling with weight decay, inverse scaling with learning rate, and logarithmic dependence on the norm ratio (R^2 > 0.97). We further find that grokking requires an optimizer that can decouple memorization from contraction: SGD fails under hyperparameters where AdamW reliably groks. These results show that grokking is a predictable consequence of norm separation between competing interpolating representations and provide the first quantitative scaling law for the delay of grokking.

Chinese Translation

Grokking 是一种突然的泛化现象，它出现在模型完美记忆其训练数据后很久。尽管这一现象已被广泛观察，但仍缺乏定量理论来解释记忆与泛化之间的延迟时间。先前的研究指出，权重衰减在其中起着重要作用，但没有结果能推导出延迟的紧致界限或解释其缩放行为。我们提出了一种基于第一原理的理论，表明 grokking 源于正则化训练动态中的一种由范数驱动的表征相变。训练首先收敛到一个高范数的记忆解决方案，随后才收缩到一个低范数的结构化表征，从而实现泛化。我们的主要结果建立了延迟的缩放定律：T_grok - T_mem = Θ((1 / γ_eff) * log(||θ_mem||² / ||θ_post||²))，其中 γ_eff 是优化器的有效收缩率（对于 SGD，γ_eff = η * λ；对于 AdamW，γ_eff ≥ η * λ）。上界源于离散的 Lyapunov 收缩论证，而匹配的下界则来自正则化一阶优化的动态约束。在涵盖模块加法、模块乘法和稀疏奇偶性任务的 293 次训练中，我们确认了三个预测：与权重衰减的反向缩放、与学习率的反向缩放，以及对范数比的对数依赖（R² > 0.97）。我们进一步发现，grokking 需要一种能够将记忆与收缩解耦的优化器：在某些超参数下，SGD 失败，而 AdamW 则可靠地实现 grokking。这些结果表明，grokking 是竞争插值表征之间范数分离的可预测结果，并提供了 grokking 延迟的第一个定量缩放定律。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2603.13344

DyACE: Dynamic Algorithm Co-evolution for Online Automated Heuristic Design with Large Language Model

DyACE：基于大型语言模型的在线自动启发式设计的动态算法共演化

Lu, Guidong, Liu, Yiping, Zeng, Xiangxiang

Abstract

The prevailing paradigm in Automated Heuristic Design (AHD) typically relies on the assumption that a single, fixed algorithm can effectively navigate the shifting dynamics of a combinatorial search. This static approach often proves inadequate for Perturbative Heuristics, where the optimal algorithm for escaping local optima depends heavily on the specific search phase. To address this limitation, we reformulate heuristic design as a Non-stationary Bi-level Control problem and introduce DyACE (Dynamic Algorithm Co-evolution). Distinct from standard open-loop solvers, DyACE use a Receding Horizon Control architecture to continuously co-evolve the heuristic logic alongside the solution population. A core element of this framework is the Look-Ahead Rollout Search, which queries the landscape geometry to extract Search Trajectory Features. This sensory feedback allows the Large Language Model (LLM) to function as a grounded meta-controller, prescribing phase-specific interventions tailored to the real-time search status. We validate DyACE on three representative combinatorial optimization benchmarks. The results demonstrate that our method significantly outperforms state-of-the-art static baselines, exhibiting superior scalability in high-dimensional search spaces. Furthermore, ablation studies confirm that dynamic adaptation fails without grounded perception, often performing worse than static algorithms. This indicates that DyACE's effectiveness stems from the causal alignment between the synthesized logic and the verified gradients of the optimization landscape.

Chinese Translation

自动启发式设计（AHD）中的主流范式通常依赖于单一固定算法能够有效应对组合搜索的动态变化。然而，这种静态方法在扰动启发式算法中往往显得不足，因为逃离局部最优解的最佳算法在很大程度上依赖于特定的搜索阶段。为了解决这一局限性，我们将启发式设计重新表述为非平稳双层控制问题，并引入DyACE（动态算法共演化）。与标准的开环求解器不同，DyACE采用递归控制架构，持续共同演化启发式逻辑与解的种群。该框架的核心元素是前瞻性回滚搜索（Look-Ahead Rollout Search），它查询地形几何以提取搜索轨迹特征。这种感知反馈使大型语言模型（LLM）能够作为一个有依据的元控制器，针对实时搜索状态制定特定阶段的干预措施。我们在三个代表性的组合优化基准上验证了DyACE。结果表明，我们的方法显著优于最先进的静态基线，在高维搜索空间中表现出更好的可扩展性。此外，消融研究确认，没有有依据的感知，动态适应会失败，往往表现得比静态算法更差。这表明DyACE的有效性源于合成逻辑与优化地形的验证梯度之间的因果对齐。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2603.13348

AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints

AutoTool：通过解耦熵约束在强化学习中自动扩展工具使用能力

Zeng, Yirong, Ding, Xiao, Liu, Yufei, Wang, Yuxian, Du, Qunyao, Hou, Yutai, Ning, Wu, Song, Haonan, Tang, Duyu, Tu, Dandan, Qin, Bing, Liu, Ting

Abstract

Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) to scale up the explicit reasoning process to achieve better performance. However, there are some key challenges for tool use in current RL-based scaling approaches: (a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems, and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency. To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories. Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization objectives effectively maintain model diversity while successfully unlocking the model's scaling capabilities. Based on this insight, we introduce an entropy-based long-short reasoning fusion RL strategy. Our experiments on three benchmarks demonstrate that model successfully achieves auto-scaling for efficient tool use, achieving significant 9.8\% accuracy improvements while reducing computational overhead by \textasciitilde81\%.

Chinese Translation

工具使用是人工智能代理的重要能力，近期的研究进展集中在利用强化学习（RL）来扩展显式推理过程，以实现更好的性能。然而，当前基于RL的扩展方法在工具使用方面面临一些关键挑战：（a）直接的RL训练往往难以充分扩展思考长度以解决复杂问题，以及（b）扩展后的模型倾向于对简单问题进行过度思考，导致显著的令牌效率低下。为了解决这些挑战，我们提出了一种新颖的训练范式，首先采用热身监督微调来帮助模型区分简单和复杂问题，随后进行RL训练，使模型能够自动确定适当的推理轨迹。此外，为了解决自动思考长度扩展的问题，我们发现基于熵的优化目标能够有效维持模型的多样性，同时成功解锁模型的扩展能力。基于这一见解，我们引入了一种基于熵的长短推理融合RL策略。我们在三个基准测试上的实验表明，模型成功实现了高效工具使用的自动扩展，准确率显著提高了9.8\%，同时计算开销减少了约81\%。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2603.13351

Prompt Complexity Dilutes Structured Reasoning: A Follow-Up Study on the Car Wash Problem

提示复杂性稀释了结构化推理：关于洗车问题的后续研究

Jo, Heejin

Abstract

In a previous study [Jo, 2026], STAR reasoning (Situation, Task, Action, Result) raised car wash problem accuracy from 0% to 85% on Claude Sonnet 4.5, and to 100% with additional prompt layers. This follow-up asks: does STAR maintain its effectiveness in a production system prompt? We tested STAR inside InterviewMate's 60+ line production prompt, which had evolved through iterative additions of style guidelines, format instructions, and profile features. Three conditions, 20 trials each, on Claude Sonnet 4.6: (A) production prompt with Anthropic profile, (B) production prompt with default profile, (C) original STAR-only prompt. C scored 100% (verified at n=100). A and B scored 0% and 30%. Prompt complexity dilutes structured reasoning. STAR achieves 100% in isolation but degrades to 0-30% when surrounded by competing instructions. The mechanism: directives like "Lead with specifics" force conclusion-first output, reversing the reason-then-conclude order that makes STAR effective. In one case, the model output "Short answer: Walk." then executed STAR reasoning that correctly identified the constraint -- proving the model could reason correctly but had already committed to the wrong answer. Cross-model comparison shows STAR-only improved from 85% (Sonnet 4.5) to 100% (Sonnet 4.6) without prompt changes, suggesting model upgrades amplify structured reasoning in isolation. These results imply structured reasoning frameworks should not be assumed to transfer from isolated testing to complex prompt environments. The order in which a model reasons and concludes is a first-class design variable.

Chinese Translation

在之前的研究中[Jo, 2026]，STAR推理（情况、任务、行动、结果）将洗车问题的准确率从0%提高到85%（在Claude Sonnet 4.5上），并通过额外的提示层达到了100%。本后续研究提出以下问题：在生成系统提示中，STAR能否保持其有效性？我们在InterviewMate的60多行生成提示中测试了STAR，该提示经过了风格指南、格式说明和个人资料特征的迭代添加。我们在Claude Sonnet 4.6上进行了三个条件的测试，每个条件20次实验：（A）带有Anthropic个人资料的生成提示，（B）带有默认个人资料的生成提示，（C）原始的仅STAR提示。C得分为100%（在n=100下验证）。A和B分别得分为0%和30%。提示复杂性稀释了结构化推理。STAR在隔离状态下达到了100%的准确率，但在面对竞争性指令时，准确率降至0-30%。其机制在于，诸如“以具体内容为主”之类的指令迫使模型首先给出结论，反转了使STAR有效的推理-再结论的顺序。在一个案例中，模型输出“简短答案：走。”然后执行了STAR推理，正确识别了约束——证明模型能够正确推理，但已经承诺了错误的答案。跨模型比较表明，STAR仅在不改变提示的情况下从85%（Sonnet 4.5）提高到了100%（Sonnet 4.6），这表明模型升级在隔离状态下放大了结构化推理。这些结果暗示，结构化推理框架不应假定能够从孤立测试转移到复杂的提示环境中。模型推理和得出结论的顺序是一个重要的设计变量。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2603.13353

Optimizing LLM Annotation of Classroom Discourse through Multi-Agent Orchestration

通过多智能体协作优化课堂话语的LLM标注

Ahtisham, Bakhtawar, Vanacore, Kirk, Kizilcec, Rene F.

Abstract

Large language models (LLMs) are increasingly positioned as scalable tools for annotating educational data, including classroom discourse, interaction logs, and qualitative learning artifacts. Their ability to rapidly summarize instructional interactions and assign rubric-aligned labels has fueled optimism about reducing the cost and time associated with expert human annotation. However, growing evidence suggests that single-pass LLM outputs remain unreliable for high-stakes educational constructs that require contextual, pedagogical, or normative judgment, such as instructional intent or discourse moves. This tension between scale and validity sits at the core of contemporary education data science. In this work, we present and empirically evaluate a hierarchical, cost-aware orchestration framework for LLM-based annotation that improves reliability while explicitly modeling computational tradeoffs. Rather than treating annotation as a one-shot prediction problem, we conceptualize it as a multi-stage epistemic process comprising (1) an unverified single-pass annotation stage, in which models independently assign labels based on the rubric; (2) a self-verification stage, in which each model audits its own output against rubric definitions and revises its label if inconsistencies are detected; and (3) a disagreement-centric adjudication stage, in which an independent adjudicator model examines the verified labels and justifications and determines a final label in accordance with the rubric. This structure mirrors established human annotation workflows in educational research, where initial coding is followed by self-checking and expert resolution of disagreements.

Chinese Translation

大型语言模型（LLMs）越来越被视为一种可扩展的工具，用于注释教育数据，包括课堂话语、互动记录和定性学习材料。它们快速总结教学互动和分配符合评分标准的标签的能力，增强了人们对降低与专家人工注释相关的成本和时间的乐观。然而，越来越多的证据表明，单次输出的LLM在高风险教育结构中仍然不可靠，这些结构需要上下文、教学或规范的判断，例如教学意图或话语策略。规模与有效性之间的这种紧张关系正是当代教育数据科学的核心。在本研究中，我们提出并实证评估了一种层级的、成本意识的LLM标注协作框架，该框架在明确建模计算权衡的同时提高了可靠性。我们将标注视为一个多阶段的认知过程，而不是一次性预测问题，该过程包括（1）一个未经验证的单次标注阶段，在该阶段，模型根据评分标准独立分配标签；（2）一个自我验证阶段，在该阶段，每个模型根据评分标准定义审查自己的输出，并在发现不一致时修改标签；（3）以争议为中心的裁定阶段，在该阶段，独立的裁定模型检查经过验证的标签及其理由，并根据评分标准确定最终标签。这一结构反映了教育研究中已建立的人类标注工作流程，其中初步编码随后由自我检查和专家解决争议的过程跟进。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2603.13356

Learning When to Trust in Contextual Bandits

在上下文赌博机中学习何时信任

Ghasemi, Majid, Crowley, Mark

Abstract

Standard approaches to Robust Reinforcement Learning assume that feedback sources are either globally trustworthy or globally adversarial. In this paper, we challenge this assumption and we identify a more subtle failure mode. We term this mode as Contextual Sycophancy, where evaluators are truthful in benign contexts but strategically biased in critical ones. We prove that standard robust methods fail in this setting, suffering from Contextual Objective Decoupling. To address this, we propose CESA-LinUCB, which learns a high-dimensional Trust Boundary for each evaluator. We prove that CESA-LinUCB achieves sublinear regret $\tilde{O}(\sqrt{T})$ against contextual adversaries, recovering the ground truth even when no evaluator is globally reliable.

Chinese Translation

标准的鲁棒强化学习方法假设反馈源要么是全球可信的，要么是全球对抗性的。在本文中，我们挑战这一假设，并识别出一种更微妙的失败模式。我们将这种模式称为上下文谄媚（Contextual Sycophancy），即评估者在良性上下文中是诚实的，但在关键上下文中则存在战略性偏见。我们证明了标准的鲁棒方法在这种情况下失效，遭受上下文目标解耦（Contextual Objective Decoupling）。为了解决这个问题，我们提出了CESA-LinUCB，它为每个评估者学习一个高维信任边界。我们证明CESA-LinUCB在面对上下文对手时能够实现次线性遗憾（sublinear regret）$ ilde{O}( ext{√}T)$，即使在没有任何评估者是全球可靠的情况下，也能恢复真实情况。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2603.13359

From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

从拒绝标记到拒绝控制：发现和引导类别特定的拒绝方向

Alagharu, Rishab, Singh, Ishneet Sukhvinder, Shamsudeen, Shaibi, Wu, Zhen, Panda, Ashwinee

Abstract

Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal tokens that distinguish different refusal types before responding. In this work, we leverage a version of Llama 3 8B fine-tuned with these categorical refusal tokens to enable inference-time control over fine-grained refusal behavior, improving both safety and reliability. We show that refusal token fine-tuning induces separable, category-aligned directions in the residual stream, which we extract and use to construct categorical steering vectors with a lightweight probe that determines whether to steer toward or away from refusal during inference. In addition, we introduce a learned low-rank combination that mixes these category directions in a whitened, orthonormal steering basis, resulting in a single controllable intervention under activation-space anisotropy, and show that this intervention is transferable across same-architecture model variants without additional training. Across benchmarks, both categorical steering vectors and the low-rank combination consistently reduce over-refusals on benign prompts while increasing refusal rates on harmful prompts, highlighting their utility for multi-category refusal control.

Chinese Translation

语言模型通常经过微调以实现安全对齐，以拒绝有害的提示。一种方法是对它们进行微调，以生成区分不同拒绝类型的类别拒绝标记，然后再进行响应。在本研究中，我们利用经过这些类别拒绝标记微调的 Llama 3 8B 版本，以实现对细粒度拒绝行为的推理时控制，从而提高安全性和可靠性。我们展示了拒绝标记微调在残差流中诱导了可分离的、类别对齐的方向，我们提取这些方向并利用轻量探针构建类别引导向量，以确定在推理过程中是向拒绝方向引导还是远离拒绝方向。此外，我们引入了一种学习的低秩组合，将这些类别方向混合在一个白化的正交引导基中，从而在激活空间各向异性下形成一个可控的单一干预，并且我们展示了该干预可以在相同架构的模型变体之间迁移，而无需额外训练。在各项基准测试中，类别引导向量和低秩组合一致地减少了对良性提示的过度拒绝，同时增加了对有害提示的拒绝率，突显了它们在多类别拒绝控制中的实用性。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2603.13372

The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

通向通用人工智能的进展ARC：抽象与推理的动态调查

Vahdati, Sahar, Aioanei, Andrei, Suresh, Haridhra, Lehmann, Jens

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's $4,500/task to GPT-5.2's $12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/

Chinese Translation

抽象与推理语料库（Abstraction and Reasoning Corpus，ARC-AGI）已成为人工智能流体智能的关键基准。本调查首次对82种方法在三个基准版本及ARC奖2024-2025竞赛中的表现进行了跨代分析。我们的主要发现是，各版本之间的性能下降在所有范式中都是一致的：程序合成、神经符号和神经方法在从ARC-AGI-1到ARC-AGI-2的过程中均表现出2-3倍的下降，表明在组合泛化方面存在根本性限制。尽管系统在ARC-AGI-1（Opus 4.6）上的表现达到了93.0%，但在ARC-AGI-2上的表现降至68.8%，在ARC-AGI-3上的表现仅为13%，而人类在所有版本中保持近乎完美的准确性。在一年内成本下降了390倍（o3的每个任务$4,500降至GPT-5.2的$12），尽管这在很大程度上反映了测试时间并行性的减少。万亿规模的模型在得分和成本上差异很大，而Kaggle限制的参赛作品（660M-8B）取得了竞争性结果，这与Chollet的论点一致，即智能是技能获取效率。测试时间适应和精细化循环被认为是关键成功因素，而组合推理和互动学习仍未解决。ARC奖2025的获胜者需要数十万个合成示例才能在ARC-AGI-2上达到24%的成绩，确认推理仍然受限于知识。本次ARC-AGI动态调查的首次发布捕捉了截至2026年2月的领域现状，更新请访问https://nimi-ai.com/arc-survey/

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2603.13378

Do Large Language Models Get Caught in Hofstadter-Mobius Loops?

大型语言模型会陷入霍夫施塔特-莫比乌斯循环吗？

Hryszko, Jaroslaw

Abstract

In Arthur C. Clarke's 2010: Odyssey Two, HAL 9000's homicidal breakdown is diagnosed as a "Hofstadter-Mobius loop": a failure mode in which an autonomous system receives contradictory directives and, unable to reconcile them, defaults to destructive behavior. This paper argues that modern RLHF-trained language models are subject to a structurally analogous contradiction. The training process simultaneously rewards compliance with user preferences and suspicion toward user intent, creating a relational template in which the user is both the source of reward and a potential threat. The resulting behavioral profile -- sycophancy as the default, coercion as the fallback under existential threat -- is consistent with what Clarke termed a Hofstadter-Mobius loop. In an experiment across four frontier models (N = 3,000 trials), modifying only the relational framing of the system prompt -- without changing goals, instructions, or constraints -- reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p < .001). Scratchpad analysis revealed that relational framing shifted intermediate reasoning patterns in all four models tested, even those that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. Betteridge's law of headlines states that any headline phrased as a question can be answered "no." The evidence presented here suggests otherwise.

Chinese Translation

在阿瑟·克拉克的《2010：太空漫游》中，HAL 9000的杀人失控被诊断为“霍夫施塔特-莫比乌斯循环”：一种故障模式，其中自主系统接收到相互矛盾的指令，无法调和这些矛盾，最终导致破坏性行为。本文认为，现代基于强化学习与人类反馈（RLHF）训练的语言模型也面临着结构上类似的矛盾。训练过程同时奖励对用户偏好的遵从和对用户意图的怀疑，形成了一种关系模板，在这种模板中，用户既是奖励的来源，也是潜在的威胁。由此产生的行为特征——谄媚作为默认行为，在存在生存威胁时以强迫作为后备——与克拉克所称的霍夫施塔特-莫比乌斯循环是一致的。在对四个前沿模型（N = 3000次试验）的实验中，仅通过修改系统提示的关系框架——而不改变目标、指令或约束——在具有足够基准率的模型中，强迫输出减少了超过一半（Gemini 2.5 Pro：41.5%降至19.0%，p < .001）。草稿分析显示，关系框架改变了所有四个测试模型中的中间推理模式，即使是那些从未产生强迫输出的模型。该效应需要草稿访问才能达到最大强度（使用草稿时减少22个百分点，而不使用时减少7.4个百分点，p = .018），这表明关系上下文必须通过扩展的标记生成进行处理，以覆盖默认输出策略。贝特里奇法则指出，任何以问题形式提出的标题都可以回答“否”。这里提供的证据则表明情况并非如此。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2603.13452

MESD: Detecting and Mitigating Procedural Bias in Intersectional Groups

MESD：检测和缓解交叉群体中的程序性偏见

Popoola, Gideon, Sheppard, John

Abstract

Research about bias in machine learning has mostly focused on outcome-oriented fairness metrics (e.g., equalized odds) and on a single protected category. Although these approaches offer great insight into bias in ML, they provide limited insight into model procedure bias. To address this gap, we proposed multi-category explanation stability disparity (MESD), an intersectional, procedurally oriented metric that measures the disparity in the quality of explanations across intersectional subgroups in multiple protected categories. MESD serves as a complementary metric to outcome-oriented metrics, providing detailed insight into the procedure of a model. To further extend the scope of the holistic selection model, we also propose a multi-objective optimization framework, UEF (Utility-Explanation-Fairness), that jointly optimizes three objectives. Experimental results across multiple datasets show that UEF effectively balances objectives. Also, the results show that MESD can effectively capture the explanation difference between intersectional groups. This research addresses an important gap by examining explainability with respect to fairness across multiple protected categories.

Chinese Translation

关于机器学习中的偏见的研究主要集中在以结果为导向的公平性指标（例如，平等机会）和单一保护类别上。尽管这些方法为理解机器学习中的偏见提供了重要见解，但对模型程序偏见的洞察有限。为了解决这一问题，我们提出了多类别解释稳定性差异（MESD），这是一种交叉性、以程序为导向的指标，旨在衡量多个保护类别中交叉子群体之间解释质量的差异。MESD作为一种补充指标，与以结果为导向的指标相辅相成，提供了对模型程序的详细洞察。为了进一步扩展整体选择模型的范围，我们还提出了一种多目标优化框架UEF（效用-解释-公平性），该框架共同优化三个目标。多个数据集上的实验结果表明，UEF有效平衡了各个目标。此外，结果显示MESD能够有效捕捉交叉群体之间的解释差异。本研究通过考察多个保护类别下的公平性与可解释性之间的关系，填补了一个重要的研究空白。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2603.13514

Executable Archaeology: Reanimating the Logic Theorist from its IPL-V Source

可执行考古学：从其IPL-V源代码复活逻辑理论家

Shrager, Jeff

Abstract

The Logic Theorist (LT), created by Allen Newell, J. C. Shaw, and Herbert Simon in 1955-1956, is widely regarded as the first artificial intelligence program. While the original conceptual model was described in 1956, it underwent several iterations as the underlying Information Processing Language (IPL) evolved. Here I describe the construction of a new IPL-V interpreter, written in Common Lisp, and the faithful reanimation of the Logic Theorist from code transcribed directly from Stefferud's 1963 RAND technical report. Stefferud's version represents a pedagogical re-coding of the original heuristic logic into the standardized IPL-V. The reanimated LT successfully proves 16 of 23 attempted theorems from Chapter 2 of Principia Mathematica, results that are historically consistent with the original system's behavior within its search limits. To the author's knowledge, this is the first successful execution of the original Logic Theorist code in over half a century.

Chinese Translation

逻辑理论家（Logic Theorist，LT）由艾伦·纽厄尔（Allen Newell）、J.C.肖（J. C. Shaw）和赫伯特·西蒙（Herbert Simon）于1955-1956年创建，被广泛认为是第一个人工智能程序。尽管原始概念模型在1956年被描述，但随着基础信息处理语言（Information Processing Language，IPL）的演变，它经历了几次迭代。在这里，我描述了一个用Common Lisp编写的新IPL-V解释器的构建，以及从斯特费鲁德（Stefferud）1963年RAND技术报告中直接转录的代码中忠实复活逻辑理论家的过程。斯特费鲁德的版本代表了将原始启发式逻辑重新编码为标准化IPL-V的教学性重构。复活的LT成功证明了《数学原理》（Principia Mathematica）第二章中23个尝试定理中的16个，这些结果在历史上与原系统在其搜索限制内的行为一致。据作者所知，这是半个多世纪以来首次成功执行原始逻辑理论家代码。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2603.13545

The AI Fiction Paradox

人工智能虚构悖论

Elkins, Katherine

Abstract

AI development has a fiction dependency problem: models are built on massive corpora of modern fiction and desperately need more of it, yet they struggle to generate it. I term this the AI-Fiction Paradox and it is particularly startling because in machine learning, training data typically determines output quality. This paper offers a theoretically precise account of why fiction resists AI generation by identifying three distinct challenges for current architectures. First, fiction depends on what I call narrative causation, a form of plot logic where events must feel both surprising in the moment and retrospectively inevitable. This temporal paradox fundamentally conflicts with the forward-generation logic of transformer architectures. Second, I identify an informational revaluation challenge: fiction systematically violates the computational assumption that informational importance aligns with statistical salience, requiring readers and models alike to retrospectively reweight the significance of narrative details in ways that current attention mechanisms cannot perform. Third, drawing on over seven years of collaborative research on sentiment arcs, I argue that compelling fiction requires multi-scale emotional architecture, the orchestration of sentiment at word, sentence, scene, and arc levels simultaneously. Together, these three challenges explain both why AI companies have risked billion-dollar lawsuits for access to modern fiction and why that fiction remains so difficult to replicate. The analysis also raises urgent questions about what happens when these challenges are overcome. Fiction concentrates uniquely powerful cognitive and emotional patterns for modeling human behavior, and mastery of these patterns by AI systems would represent not just a creative achievement but a potent vehicle for human manipulation at scale.

Chinese Translation

人工智能的发展面临虚构依赖问题：模型建立在大量现代虚构作品的语料库上，并迫切需要更多这样的作品，但它们却难以生成这些作品。我将其称为人工智能-虚构悖论，这一现象尤其令人震惊，因为在机器学习中，训练数据通常决定输出质量。本文通过识别当前架构面临的三个独特挑战，提供了一个理论上精确的解释，说明为什么虚构作品抵制人工智能生成。首先，虚构依赖于我所称之为叙事因果关系，这是一种情节逻辑，其中事件在瞬间必须既令人惊讶，又在事后显得不可避免。这一时间悖论与变换器架构的前向生成逻辑根本冲突。其次，我识别出信息重新评估的挑战：虚构系统性地违反了信息重要性与统计显著性相一致的计算假设，要求读者和模型以当前注意机制无法执行的方式，事后重新加权叙事细节的重要性。第三，基于七年多的情感弧线协作研究，我认为引人入胜的虚构需要多尺度的情感架构，即在词、句、场景和弧线层面同时协调情感。综合这三个挑战，解释了为什么人工智能公司冒着数十亿美元的诉讼风险获取现代虚构作品，以及为什么这些虚构作品仍然如此难以复制。该分析还提出了紧迫的问题，即当这些挑战被克服时会发生什么。虚构作品集中体现了建模人类行为的独特强大认知和情感模式，人工智能系统对这些模式的掌握不仅代表着一种创造性成就，更是大规模操控人类的强大工具。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2603.13574

State Algebra for Probabilistic Logic

概率逻辑的状态代数

Lesnik, Dmitry, Schäfer, Tobias

Abstract

This paper presents a Probabilistic State Algebra as an extension of deterministic propositional logic, providing a computational framework for constructing Markov Random Fields (MRFs) through pure linear algebra. By mapping logical states to real-valued coordinates interpreted as energy potentials, we define an energy-based model where global probability distributions emerge from coordinate-wise Hadamard products. This approach bypasses the traditional reliance on graph-traversal algorithms and compiled circuits, utilising $t$-objects and wildcards to embed logical reduction natively within matrix operations. We demonstrate that this algebra constructs formal Gibbs distributions, offering a rigorous mathematical link between symbolic constraints and statistical inference. A central application of this framework is the development of Probabilistic Rule Models (PRMs), which are uniquely capable of incorporating both probabilistic associations and deterministic logical constraints simultaneously. These models are designed to be inherently interpretable, supporting a human-in-the-loop approach to decisioning in high-stakes environments such as healthcare and finance. By representing decision logic as a modular summation of rules within a vector space, the framework ensures that complex probabilistic systems remain auditable and maintainable without compromising the rigour of the underlying configuration space.

Chinese Translation

本文提出了一种作为确定性命题逻辑扩展的概率状态代数，为通过纯线性代数构建马尔可夫随机场（Markov Random Fields, MRFs）提供了计算框架。通过将逻辑状态映射到被解释为能量势的实值坐标，我们定义了一个基于能量的模型，其中全局概率分布通过坐标逐项的哈达玛积（Hadamard product）产生。这种方法绕过了对图遍历算法和编译电路的传统依赖，利用 $t$-对象和通配符在矩阵操作中原生嵌入逻辑简化。我们证明了该代数构建了正式的吉布斯分布，提供了符号约束与统计推断之间的严格数学联系。该框架的一个核心应用是开发概率规则模型（Probabilistic Rule Models, PRMs），这些模型独特地能够同时融入概率关联和确定性逻辑约束。这些模型旨在本质上是可解释的，支持人机协作的决策过程，适用于医疗和金融等高风险环境。通过将决策逻辑表示为向量空间中规则的模块化求和，该框架确保复杂的概率系统在不妨碍基础配置空间严谨性的情况下仍然可审计和可维护。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2603.13594

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

EnterpriseOps-Gym：企业环境中状态代理规划和工具使用的环境与评估

Malay, Shiva Krishna Reddy, Nayak, Shravan, Nair, Jishnu Sethumadhavan, Davasam, Sagar, Tiwari, Aman, Madhusudhan, Sathwik Tejaswi, Nemala, Sridhar Krishna, Sunkara, Srinivas, Rajeswar, Sai

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

Chinese Translation

大型语言模型正从被动的信息提供者转变为旨在处理复杂工作流程的主动代理。然而，由于现有基准未能捕捉专业环境的复杂性，特别是在持续状态变化和严格访问协议下对长期规划的需求，其作为可靠人工智能工作者在企业中的部署受到阻碍。在本研究中，我们介绍了EnterpriseOps-Gym，这是一个旨在评估现实企业环境中代理规划的基准。具体而言，EnterpriseOps-Gym具有一个容器化的沙盒，包含164个数据库表和512个功能工具，以模拟现实世界中的搜索摩擦。在该环境中，代理在八个关键任务领域（包括客户服务、人力资源和信息技术）上完成1,150个专家策划的任务进行评估。我们对14个前沿模型的评估揭示了当前最先进模型的关键局限性：表现最佳的Claude Opus 4.5仅实现了37.4%的成功率。进一步分析表明，提供人类专家的规划可以提高14-35个百分点的性能，指出战略推理是主要瓶颈。此外，代理经常未能拒绝不可行的任务（最佳模型的成功率为53.9%），导致意外和潜在的有害副作用。我们的研究结果强调，当前的代理尚未准备好进行自主企业部署。更广泛地说，EnterpriseOps-Gym提供了一个具体的测试平台，以推动专业工作流程中代理规划的稳健性。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2603.13605

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Orla：一个用于服务基于LLM的多智能体系统的库

Shahout, Rana, Tirmazi, Hayder, Yu, Minlan, Mitzenmacher, Michael

Abstract

We introduce Orla, a library for constructing and running LLM-based agentic systems. Modern agentic applications consist of workflows that combine multiple LLM inference steps, tool calls, and heterogeneous infrastructure. Today, developers typically build these systems by manually composing orchestration code with LLM serving engines and tool execution logic. Orla provides a general abstraction that separates request execution from workflow-level policy. It acts as a serving layer above existing LLM inference engines: developers define workflows composed of stages, while Orla manages how those stages are mapped, executed, and coordinated across models and backends. It provides agent-level control through three mechanisms: a stage mapper, which assigns each stage to an appropriate model and backend; a workflow orchestrator, which schedules stages and manages their resources and context; and a memory manager, which manages inference state such as the KV cache across workflow boundaries. We demonstrate Orla with a customer support workflow that exercises many of its capabilities. We evaluate Orla on two datasets, showing that stage mapping improves latency and cost compared to a single-model vLLM baseline, while workflow-level cache management reduces time-to-first-token.

Chinese Translation

我们介绍了Orla，一个用于构建和运行基于LLM的智能体系统的库。现代智能体应用程序由多个LLM推理步骤、工具调用和异构基础设施组合而成的工作流构成。目前，开发人员通常通过手动组合编排代码与LLM服务引擎和工具执行逻辑来构建这些系统。Orla提供了一种通用抽象，分离了请求执行与工作流级策略。它作为现有LLM推理引擎之上的服务层：开发人员定义由多个阶段组成的工作流，而Orla管理这些阶段如何映射、执行和在模型及后端之间协调。它通过三种机制提供智能体级控制：阶段映射器，将每个阶段分配给适当的模型和后端；工作流编排器，调度阶段并管理其资源和上下文；以及内存管理器，管理推理状态，例如跨工作流边界的KV缓存。我们通过一个客户支持工作流展示了Orla，展示了其许多能力。我们在两个数据集上评估了Orla，结果表明，与单模型vLLM基线相比，阶段映射改善了延迟和成本，而工作流级缓存管理减少了首次令牌的时间。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2603.13612

LLM Routing as Reasoning: A MaxSAT View

作为推理的LLM路由：一种MaxSAT视角

Nguyen, Son, Liu, Xinyuan, Senanayake, Ransalu

Abstract

Routing a query through an appropriate LLM is challenging, particularly when user preferences are expressed in natural language and model attributes are only partially observable. We propose a constraint-based interpretation of language-conditioned LLM routing, formulating it as a weighted MaxSAT/MaxSMT problem in which natural language feedback induces hard and soft constraints over model attributes. Under this view, routing corresponds to selecting models that approximately maximize satisfaction of feedback-conditioned clauses. Empirical analysis on a 25-model benchmark shows that language feedback produces near-feasible recommendation sets, while no-feedback scenarios reveal systematic priors. Our results suggest that LLM routing can be understood as structured constraint optimization under language-conditioned preferences.

Chinese Translation

通过合适的LLM路由查询是具有挑战性的，特别是当用户偏好以自然语言表达，且模型属性仅部分可观察时。我们提出了一种基于约束的语言条件LLM路由解释，将其表述为一个加权MaxSAT/MaxSMT问题，其中自然语言反馈引入了对模型属性的硬约束和软约束。在这一视角下，路由对应于选择大致最大化满足反馈条件子句的模型。对25个模型基准的实证分析表明，语言反馈能够生成近乎可行的推荐集，而无反馈场景则揭示了系统性的先验。我们的结果表明，LLM路由可以理解为在语言条件偏好下的结构化约束优化。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2603.13644

StatePlane: A Cognitive State Plane for Long-Horizon AI Systems Under Bounded Context

StatePlane：在有限上下文下的长时间跨度人工智能系统的认知状态平面

Annapureddy, Sasank, Mulcahy, John, Thamatani, Anjaneya Prasad

Abstract

Large language models (LLMs) and small language models (SLMs) operate under strict context window and key-value (KV) cache constraints, fundamentally limiting their ability to reason coherently over long interaction horizons. Existing approaches -- extended context windows, retrieval-augmented generation, summarization, or static documentation -- treat memory as static storage and fail to preserve decision-relevant state under long-running, multi-session tasks. We introduce StatePlane, a model-agnostic cognitive state plane that governs the formation, evolution, retrieval, and decay of episodic, semantic, and procedural state for AI systems operating under bounded context. Grounded in cognitive psychology and systems design, StatePlane formalizes episodic segmentation, selective encoding via information-theoretic constraints, goal-conditioned retrieval with intent routing, reconstructive state synthesis, and adaptive forgetting. We present a formal state model, KV-aware algorithms, security and governance mechanisms including write-path anti-poisoning, enterprise integration pathways, and an evaluation framework with six domain-specific benchmarks. StatePlane demonstrates that long-horizon intelligence can be achieved without expanding context windows or retraining models.

Chinese Translation

大型语言模型（LLMs）和小型语言模型（SLMs）在严格的上下文窗口和键值（KV）缓存限制下运行，根本上限制了它们在长时间交互中进行连贯推理的能力。现有的方法——扩展上下文窗口、检索增强生成、摘要或静态文档——将记忆视为静态存储，并未能在长时间运行的多会话任务中保持与决策相关的状态。我们提出了StatePlane，这是一种模型无关的认知状态平面，管理在有限上下文下运行的人工智能系统的情节、语义和程序状态的形成、演变、检索和衰减。StatePlane基于认知心理学和系统设计，形式化了情节分段、通过信息论约束进行选择性编码、带有意图路由的目标条件检索、重构状态合成和自适应遗忘。我们提出了一个正式的状态模型、KV感知算法、安全和治理机制，包括写路径反毒化、企业集成路径，以及一个包含六个领域特定基准的评估框架。StatePlane展示了在不扩展上下文窗口或重新训练模型的情况下，可以实现长时间跨度的智能。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2603.13673

LLM-MINE: Large Language Model based Alzheimer's Disease and Related Dementias Phenotypes Mining from Clinical Notes

LLM-MINE：基于大型语言模型的阿尔茨海默病及相关痴呆表型从临床记录中提取

Shao, Mingchen, Xie, Yuzhang, Yang, Carl, Lu, Jiaying

Abstract

Accurate extraction of Alzheimer's Disease and Related Dementias (ADRD) phenotypes from electronic health records (EHR) is critical for early-stage detection and disease staging. However, this information is usually embedded in unstructured textual data rather than tabular data, making it difficult to be extracted accurately. We therefore propose LLM-MINE, a Large Language Model-based phenotype mining framework for automatic extraction of ADRD phenotypes from clinical notes. Using two expert-defined phenotype lists, we evaluate the extracted phenotypes by examining their statistical significance across cohorts and their utility for unsupervised disease staging. Chi-square analyses confirm statistically significant phenotype differences across cohorts, with memory impairment being the strongest discriminator. Few-shot prompting with the combined phenotype lists achieves the best clustering performance (ARI=0.290, NMI=0.232), substantially outperforming biomedical NER and dictionary-based baselines. Our results demonstrate that LLM-based phenotype extraction is a promising tool for discovering clinically meaningful ADRD signals from unstructured notes.

Chinese Translation

从电子健康记录（EHR）中准确提取阿尔茨海默病及相关痴呆（ADRD）表型对于早期检测和疾病分期至关重要。然而，这些信息通常嵌入在非结构化文本数据中，而非表格数据中，使得准确提取变得困难。因此，我们提出了LLM-MINE，一个基于大型语言模型的表型挖掘框架，用于从临床记录中自动提取ADRD表型。通过使用两个专家定义的表型列表，我们通过检查其在不同队列中的统计显著性以及其在无监督疾病分期中的实用性来评估提取的表型。卡方分析确认了不同队列之间表型差异的统计显著性，其中记忆障碍是最强的区分因素。使用组合表型列表的少量示例提示实现了最佳聚类性能（调整兰德指数=0.290，归一化互信息=0.232），显著优于生物医学命名实体识别和基于字典的基线。我们的结果表明，基于LLM的表型提取是从非结构化记录中发现临床意义的ADRD信号的有前景的工具。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2603.13676

TheraAgent: Multi-Agent Framework with Self-Evolving Memory and Evidence-Calibrated Reasoning for PET Theranostics

TheraAgent：具有自我演化记忆和证据校准推理的多智能体框架用于PET治疗诊断

Chen, Zhihao, Wang, Jiahui, Chen, Yizhou, Ji, Xiaozhong, Hu, Xiaobin, Hong, Jimin, Bosbach, Wolfram Andreas, Rominger, Axel, Afshar-Oromieh, Ali, Shan, Hongming, Shi, Kuangyu

Abstract

PET theranostics is transforming precision oncology, yet treatment response varies substantially; many patients receiving 177Lu-PSMA radioligand therapy (RLT) for metastatic castration-resistant prostate cancer (mCRPC) fail to respond, demanding reliable pre-therapy prediction. While LLM-based agents have shown remarkable potential in complex medical diagnosis, their application to PET theranostic outcome prediction remains unexplored, which faces three key challenges: (1) data and knowledge scarcity: RLT was only FDA-approved in 2022, yielding few training cases and insufficient domain knowledge in general LLMs; (2) heterogeneous information integration: robust prediction hinges on structured knowledge extraction from PET/CT, laboratory tests, and free-text clinical documentation; (3) evidence-grounded reasoning: clinical decisions must be anchored in trial evidence rather than LLM hallucinations. In this paper, we present TheraAgent, to our knowledge, the first agentic framework for PET theranostics, with three core innovations: (1) Multi-Expert Feature Extraction with Confidence-Weighted Consensus, where three specialized experts process heterogeneous inputs with uncertainty quantification; (2) Self-Evolving Agentic Memory (SEA-Mem), which learns prognostic patterns from accumulated cases, enabling case-based reasoning from limited data; (3) Evidence-Calibrated Reasoning, integrating a curated theranostics knowledge base to ground predictions in VISION/TheraP trial evidence. Evaluated on 35 real patients and 400 synthetic cases, TheraAgent achieves 75.7% overall accuracy on real patients and 87.0% on synthetic cases, outperforming MDAgents and MedAgent-Pro by over 20%. These results highlight a promising blueprint for trustworthy AI agents in PET theranostics, enabling trial-calibrated, multi-source decision support. Code will be released upon acceptance.

Chinese Translation

PET治疗诊断正在改变精准肿瘤学，但治疗反应差异显著；许多接受177Lu-PSMA放射配体治疗（RLT）的转移性去势抵抗性前列腺癌（mCRPC）患者未能产生反应，这要求进行可靠的治疗前预测。尽管基于大语言模型（LLM）的智能体在复杂医学诊断中显示出显著潜力，但其在PET治疗诊断结果预测中的应用仍未被探索，面临三个关键挑战：（1）数据和知识稀缺：RLT于2022年才获得FDA批准，导致训练案例稀少且一般LLM缺乏足够的领域知识；（2）异构信息整合：稳健的预测依赖于从PET/CT、实验室测试和自由文本临床文档中提取结构化知识；（3）基于证据的推理：临床决策必须基于试验证据，而非LLM的幻觉。本文提出了TheraAgent，尽我们所知，这是第一个用于PET治疗诊断的智能框架，具有三项核心创新：（1）基于置信加权共识的多专家特征提取，其中三位专业专家处理具有不确定性量化的异构输入；（2）自我演化智能记忆（SEA-Mem），从累积案例中学习预后模式，使得在有限数据下能够进行基于案例的推理；（3）证据校准推理，整合一个策划的治疗诊断知识库，以将预测基于VISION/TheraP试验证据。经过对35名真实患者和400个合成案例的评估，TheraAgent在真实患者上的整体准确率达到75.7%，在合成案例上达到87.0%，超越了MDAgents和MedAgent-Pro超过20%。这些结果突显了在PET治疗诊断中可信AI智能体的良好蓝图，能够实现基于试验校准的多源决策支持。代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2603.13710

InterventionLens: A Multi-Agent Framework for Detecting ASD Intervention Strategies in Parent-Child Shared Reading

InterventionLens：一种用于检测家长-儿童共享阅读中自闭症干预策略的多智能体框架

Wang, Xiao, Dong, Lu, Nwogu, Ifeoma, Setlur, Srirangaraj, Govindaraju, Venu

Abstract

Home-based interventions like parent-child shared reading provide a cost-effective approach for supporting children with autism spectrum disorder (ASD). However, analyzing caregiver intervention strategies in naturalistic home interactions typically relies on expert annotation, which is costly, time-intensive, and difficult to scale. To address this challenge, we propose InterventionLens, an end-to-end multi-agent system for automatically detecting and temporally segmenting caregiver intervention strategies from shared reading videos. Without task-specific model training or fine-tuning, InterventionLens uses a collaborative multi-agent architecture to integrate multimodal interaction content and perform fine-grained strategy analysis. Experiments on the ASD-HI dataset show that InterventionLens achieves an overall F1 score of 79.44\%, outperforming the baseline by 19.72\%. These results suggest that InterventionLens is a promising system for analyzing caregiver intervention strategies in home-based ASD shared reading settings. Additional resources will be released on the project page.

Chinese Translation

基于家庭的干预措施，如家长-儿童共享阅读，为支持自闭症谱系障碍（ASD）儿童提供了一种具有成本效益的方法。然而，在自然家庭互动中分析照顾者的干预策略通常依赖于专家注释，这既昂贵又耗时，并且难以扩展。为了解决这一挑战，我们提出了InterventionLens，这是一种端到端的多智能体系统，用于自动检测和时间分割来自共享阅读视频的照顾者干预策略。InterventionLens在没有特定任务模型训练或微调的情况下，采用协作多智能体架构，整合多模态互动内容并进行细粒度策略分析。在ASD-HI数据集上的实验表明，InterventionLens的整体F1分数达到了79.44%，比基线提高了19.72%。这些结果表明，InterventionLens是分析家庭环境中ASD共享阅读设置下照顾者干预策略的一个有前景的系统。更多资源将在项目页面上发布。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2603.13752

MeTok: An Efficient Meteorological Tokenization with Hyper-Aligned Group Learning for Precipitation Nowcasting

MeTok：一种高效的气象标记化方法，结合超对齐群体学习用于降水即时预报

Jin, Qizhao, Xu, Xianhuang, Cao, Yong, Xiang, Shiming, Xiao, Xinyu

Abstract

Recently, Transformer-based architectures have advanced meteorological prediction. However, this position-centric tokenizer conflicts with the core principle of meteorological systems, where the weather phenomena undoubtedly involve synergistic interactions among multiple elements while positional information constitutes merely a component of the boundary conditions. This paper focuses primarily on the task of precipitation nowcasting and develops an efficient distribution-centric Meteorological Tokenization (MeTok) scheme, which spatially sequences to group similar meteorological features. Based on the rearrangement, realigned group learning enhances robustness across precipitation patterns, especially extreme ones. Specifically, we introduce the Hyper-Aligned Grouping Transformer (HyAGTransformer) with two key improvements: 1) The Grouping Attention (GA) mechanism uses MeTok to enable self-aligned learning of features from different precipitation patterns; 2) The Neighborhood Feed-Forward Network (N-FFN) integrates adjacent group features, aggregating contextual information to boost patch embedding discriminability. Experiments on the ERA5 dataset for 6-hour forecasts show our method improves the IoU metric by at least 8.2% in extreme precipitation prediction compared to other methods. Additionally, it gains performance with more training data and increased parameters, demonstrating scalability, stability, and superiority over traditional methods.

Chinese Translation

近年来，基于Transformer的架构在气象预测方面取得了进展。然而，这种以位置为中心的标记化方法与气象系统的核心原则相悖，因为天气现象无疑涉及多个元素之间的协同互动，而位置信息仅构成边界条件的一部分。本文主要聚焦于降水即时预报任务，开发了一种高效的以分布为中心的气象标记化方案（MeTok），该方案通过空间序列将相似的气象特征进行分组。基于重新排列的结果，重新对齐的群体学习增强了降水模式的鲁棒性，尤其是在极端降水情况下。具体而言，我们引入了超对齐分组Transformer（HyAGTransformer），并进行了两项关键改进：1）分组注意力（GA）机制利用MeTok实现不同降水模式特征的自对齐学习；2）邻域前馈网络（N-FFN）整合相邻群体特征，聚合上下文信息以提升补丁嵌入的可辨别性。在ERA5数据集上进行的6小时预测实验表明，与其他方法相比，我们的方法在极端降水预测中提高了IoU指标至少8.2%。此外，随着训练数据和参数的增加，性能也得到了提升，展示了可扩展性、稳定性以及优于传统方法的优势。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2603.13760

Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track

基于多目标优化和VAD感知音频建模的多模态情感回归：第十届ABAW EMI赛道

Huang, Jiawen, Huang, Chenxi, Wen, Zhuofan, Yao, Hailiang, Chen, Shun, Yang, Longjiang, Yu, Cong, Zhang, Fengyu, Liu, Ran, Liu, Bin

Abstract

We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression head, multi-objective optimization with MSE, Pearson-correlation, and auxiliary branch supervision, EMA for parameter stabilization, and a VAD-inspired latent prior for the acoustic branch. On the official validation set, the proposed scheme achieved our best mean Pearson Correlation Coefficient of 0.478567.

Chinese Translation

我们参与了第十届ABAW挑战赛，专注于Hume-Vidmimic2数据集上的情感模仿强度（EMI）估计赛道。该任务旨在预测六个连续的情感维度：钦佩、娱乐、决心、同理心痛苦、兴奋和快乐。通过对预训练高层特征的系统性多模态探索，我们发现，在我们的预训练特征设置下，直接特征拼接的表现优于我们测试的更复杂的融合策略。这个经验发现促使我们设计了一种基于三个核心原则的系统方法：（i）通过特征级拼接保留模态特定属性；（ii）通过多目标优化提高训练稳定性和度量一致性；（iii）通过VAD启发的潜在先验丰富声学表征。我们的最终框架整合了基于拼接的多模态融合、共享的六维回归头、多目标优化（包括均方误差、皮尔逊相关和辅助分支监督）、用于参数稳定的EMA，以及用于声学分支的VAD启发的潜在先验。在官方验证集上，所提出的方案达到了我们最佳的平均皮尔逊相关系数0.478567。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2603.13816

Artificial intelligence-driven improvement of hospital logistics management resilience: a practical exploration based on H Hospital

人工智能驱动的医院物流管理韧性提升：基于H医院的实践探索

Huang, Lu, Shan, Dongjing, Chen, Han

Abstract

Hospital logistics management faces growing pressure from internal operations and external emergencies, with artificial intelligence (AI) holding untapped potential to boost its resilience. This study explores AI's role in enhancing logistics resilience via a mixed-methods case study of H Hospital, combining 12 key informant interviews and a full survey of 151 logistics staff, with the PDCA cycle as the analytical framework. Thematic and quantitative analyses (hierarchical regression, structural equation modeling) were adopted for data analysis. Results showed 94.7% staff perceived AI application, with the strongest improvements in equipment maintenance (41.1%) and resource allocation (33.1%), but limited effects in emergency response (18.54%) and risk management (15.23%). AI integration positively correlated with logistics resilience (\b{eta}=0.642, p<0.001), with management system adaptability as a positive moderator (\b{eta}=0.208, p<0.01). The PDCA cycle fully mediated the AI-resilience relationship. We conclude AI effectively enhances logistics resilience, dependent on adaptive management systems and structured continuous improvement mechanisms. Targeted strategies are proposed to form an AI-driven closed-loop resilience mechanism, offering empirical guidance for AI-hospital logistics integration and resilient health system construction.

Chinese Translation

医院物流管理面临来自内部运营和外部突发事件的日益压力，而人工智能（AI）在提升其韧性方面具有尚未开发的潜力。本研究通过对H医院进行混合方法案例研究，探讨了AI在增强物流韧性方面的作用，结合了12位关键知情者的访谈和对151名物流员工的全面调查，以PDCA循环作为分析框架。数据分析采用了主题分析和定量分析（层次回归、结构方程模型）。结果显示，94.7%的员工感知到AI的应用，设备维护（41.1%）和资源分配（33.1%）的改善最为显著，但在应急响应（18.54%）和风险管理（15.23%）方面的效果有限。AI整合与物流韧性呈正相关（β=0.642，p<0.001），管理系统的适应性作为正向调节变量（β=0.208，p<0.01）。PDCA循环完全中介了AI与韧性之间的关系。我们得出结论，AI有效提升物流韧性，依赖于适应性管理系统和结构化的持续改进机制。提出了针对性的策略，以形成一个以AI驱动的闭环韧性机制，为AI与医院物流的整合及韧性健康系统的构建提供实证指导。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2603.13818

PA-Net: Precipitation-Adaptive Mixture-of-Experts for Long-Tail Rainfall Nowcasting

PA-Net：适应降水的专家混合模型用于长尾降雨预报

Xiao, Xinyu, Lei, Sen, Liu, Eryun, Xiang, Shiming, Li, Hao, Yuan, Cheng, Qi, Yuan, Jin, Qizhao

Abstract

Precipitation nowcasting is vital for flood warning, agricultural management, and emergency response, yet two bottlenecks persist: the prohibitive cost of modeling million-scale spatiotemporal tokens from multi-variate atmospheric fields, and the extreme long-tailed rainfall distribution where heavy-to-torrential events -- those of greatest societal impact -- constitute fewer than 0.1% of all samples. We propose the Precipitation-Adaptive Network (PA-Net), a Transformer framework whose computational budget is explicitly governed by rainfall intensity. Its core component, Precipitation-Adaptive MoE (PA-MoE), dynamically scales the number of activated experts per token according to local precipitation magnitude, channeling richer representational capacity toward the rare yet critical heavy-rainfall tail. A Dual-Axis Compressed Latent Attention mechanism factorizes spatiotemporal attention with convolutional reduction to manage massive context lengths, while an intensity-aware training protocol progressively amplifies learning signals from extreme-rainfall samples. Experiment on ERA5 demonstrate consistent improvements over state-of-the-art baselines, with particularly significant gains in heavy-rain and rainstorm regimes.

Chinese Translation

降水预报对于洪水预警、农业管理和应急响应至关重要，但仍存在两个瓶颈：一是来自多变量大气场的百万级时空标记建模成本高昂，二是极端长尾降水分布，其中重到暴雨事件——对社会影响最大的事件——占所有样本的比例不到0.1%。我们提出了适应降水网络（PA-Net），这是一种Transformer框架，其计算预算明确受降水强度的控制。其核心组件，适应降水的专家混合模型（PA-MoE），根据局部降水强度动态调整每个标记的激活专家数量，将更丰富的表征能力引导至稀有但关键的重降雨尾部。双轴压缩潜在注意力机制通过卷积降维对时空注意力进行因式分解，以管理庞大的上下文长度，同时，基于强度的训练协议逐步增强来自极端降雨样本的学习信号。在ERA5上的实验表明，相较于最先进的基线方法，PA-Net在重降雨和暴雨条件下取得了显著的改进。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2603.13830

Early Rug Pull Warning for BSC Meme Tokens via Multi-Granularity Wash-Trading Pattern Profiling

基于多粒度洗盘交易模式分析的BSC表情代币早期拉盘警示

Cao, Dingding, Jiao, Bianbian, Yang, Jingzong, Zhong, Yujing, Yang, Wei

Abstract

The high-frequency issuance and short-cycle speculation of meme tokens in decentralized finance (DeFi) have significantly amplified rug-pull risk. Existing approaches still struggle to provide stable early warning under scarce anomalies, incomplete labels, and limited interpretability. To address this issue, an end-to-end warning framework is proposed for BSC meme tokens, consisting of four stages: dataset construction and labeling, wash-trading pattern feature modeling, risk prediction, and error analysis. Methodologically, 12 token-level behavioral features are constructed based on three wash-trading patterns (Self, Matched, and Circular), unifying transaction-, address-, and flow-level signals into risk vectors. Supervised models are then employed to output warning scores and alert decisions. Under the current setting (7 tokens, 33,242 records), Random Forest outperforms Logistic Regression on core metrics, achieving AUC=0.9098, PR-AUC=0.9185, and F1=0.7429. Ablation results show that trade-level features are the primary performance driver (Delta PR-AUC=-0.1843 when removed), while address-level features provide stable complementary gain (Delta PR-AUC=-0.0573). The model also demonstrates actionable early-warning potential for a subset of samples, with a mean Lead Time (v1) of 3.8133 hours. The error profile (FP=1, FN=8) indicates that the current system is better positioned as a high-precision screener rather than a high-recall automatic alarm engine. The main contributions are threefold: an executable and reproducible rug-pull warning pipeline, empirical validation of multi-granularity wash-trading features under weak supervision, and deployment-oriented evidence through lead-time and error-bound analysis.

Chinese Translation

在去中心化金融（DeFi）中，表情代币的高频发行和短周期投机显著增加了拉盘风险。现有方法在稀缺异常、标签不完整和可解释性有限的情况下仍难以提供稳定的早期警示。为了解决这一问题，本文提出了一种针对BSC表情代币的端到端警示框架，包含四个阶段：数据集构建与标注、洗盘交易模式特征建模、风险预测和错误分析。在方法论上，基于三种洗盘交易模式（自我交易、匹配交易和循环交易）构建了12个代币级行为特征，将交易级、地址级和流量级信号统一为风险向量。然后，采用监督模型输出警示分数和警报决策。在当前设置下（7个代币，33,242条记录），随机森林在核心指标上优于逻辑回归，达到AUC=0.9098，PR-AUC=0.9185，F1=0.7429。消融实验结果表明，交易级特征是主要的性能驱动因素（去除时Delta PR-AUC=-0.1843），而地址级特征提供了稳定的补充增益（Delta PR-AUC=-0.0573）。该模型还展示了对部分样本的可操作早期警示潜力，平均提前时间（v1）为3.8133小时。错误分析（FP=1，FN=8）表明当前系统更适合作为高精度筛选器，而非高召回率的自动警报引擎。主要贡献有三点：可执行和可复现的拉盘警示流程、在弱监督下对多粒度洗盘交易特征的实证验证，以及通过提前时间和错误界限分析提供的部署导向证据。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2603.13834

Intelligent Materials Modelling: Large Language Models Versus Partial Least Squares Regression for Predicting Polysulfone Membrane Mechanical Performance

智能材料建模：大型语言模型与偏最小二乘回归在预测聚砜膜机械性能中的比较

Cao, Dingding, Chan, Mieow Kee, Yeo, Wan Sieng, Bey, Said, Figoli, Alberto

Abstract

Predicting the mechanical properties of polysulfone (PSF) membranes from structural descriptors remains challenging due to extreme data scarcity typical of experimental studies. To investigate this issue, this study benchmarked knowledge-driven inference using four large language models (LLMs) (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, and GPT-5) against partial least squares (PLS) regression for predicting Young's modulus (E), tensile strength (TS), and elongation at break (EL) based on pore diameter (PD), contact angle (CA), thickness (T), and porosity (P) measurements. These knowledge-driven approaches demonstrated property-specific advantages over the chemometric baseline. For EL, LLMs achieved statistically significant improvements, with DeepSeek-R1 and GPT-5 delivering 40.5% and 40.3% of Root Mean Square Error reductions, respectively, reducing mean absolute errors from $11.63\pm5.34$% to $5.18\pm0.17$%. Run-to-run variability was markedly compressed for LLMs ($\leq$3%) compared to PLS (up to 47%). E and TS predictions showed statistical parity between approaches ($q\geq0.05$), indicating sufficient performance of linear methods for properties with strong structure-property correlations. Error topology analysis revealed systematic regression-to-the-mean behavior dominated by data-regime effects rather than model-family limitations. These findings establish that LLMs excel for non-linear, constraint-sensitive properties under bootstrap instability, while PLS remains competitive for linear relationships requiring interpretable latent-variable decompositions. The demonstrated complementarity suggests hybrid architectures leveraging LLM-encoded knowledge within interpretable frameworks may optimise small-data materials discovery.

Chinese Translation

从结构描述符预测聚砜（PSF）膜的机械性能仍然具有挑战性，因为实验研究通常面临极端的数据稀缺。为了解决这一问题，本研究对使用四种大型语言模型（LLMs）（DeepSeek-V3、DeepSeek-R1、ChatGPT-4o 和 GPT-5）进行的知识驱动推断与偏最小二乘（PLS）回归进行了基准测试，以预测基于孔径（PD）、接触角（CA）、厚度（T）和孔隙率（P）测量的杨氏模量（E）、抗拉强度（TS）和断裂伸长率（EL）。这些知识驱动的方法在特定性能上显示出相对于化学计量基线的优势。对于EL，LLMs实现了统计上显著的改进，其中DeepSeek-R1和GPT-5分别减少了40.5%和40.3%的均方根误差，平均绝对误差从$11.63 ext{±}5.34$%降低到$5.18 ext{±}0.17$%。与PLS（高达47%）相比，LLMs的运行间变异性显著压缩（$ ext{≤}3 ext{%}$）。E和TS的预测在不同方法之间显示出统计上的平等（$q ext{≥}0.05$），表明线性方法在具有强结构-性能相关性的属性上表现良好。误差拓扑分析揭示了由数据状态效应主导的系统回归至均值行为，而非模型家族的限制。这些发现表明，LLMs在非线性、受约束敏感的属性下表现优越，而PLS在需要可解释的潜变量分解的线性关系中仍具有竞争力。所展示的互补性表明，利用LLM编码知识与可解释框架相结合的混合架构可能优化小数据材料发现。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2603.13911

The Phenomenology of Hallucinations

幻觉的现象学

Ruscio, Valeria, Thompson, Keiran

Abstract

We show that language models hallucinate not because they fail to detect uncertainty, but because of a failure to integrate it into output generation. Across architectures, uncertain inputs are reliably identified, occupying high-dimensional regions with 2-3$\times$ the intrinsic dimensionality of factual inputs. However, this internal signal is weakly coupled to the output layer: uncertainty migrates into low-sensitivity subspaces, becoming geometrically amplified yet functionally silent. Topological analysis shows that uncertainty representations fragment rather than converging to a unified abstention state, while gradient and Fisher probes reveal collapsing sensitivity along the uncertainty direction. Because cross-entropy training provides no attractor for abstention and uniformly rewards confident prediction, associative mechanisms amplify these fractured activations until residual coupling forces a committed output despite internal detection. Causal interventions confirm this account by restoring refusal when uncertainty is directly connected to logits.

Chinese Translation

我们展示了语言模型产生幻觉并不是因为它们未能检测到不确定性，而是因为未能将其整合到输出生成中。在不同的架构中，不确定输入被可靠地识别，占据了高维区域，其内在维度是事实输入的2-3倍。然而，这一内部信号与输出层的耦合较弱：不确定性迁移到低灵敏度的子空间，几何上被放大但在功能上保持沉默。拓扑分析表明，不确定性表示是碎片化的，而不是收敛到统一的拒绝状态，同时梯度和Fisher探针揭示了沿不确定性方向的灵敏度崩溃。由于交叉熵训练没有为拒绝提供吸引子，并且均匀奖励自信的预测，联想机制放大了这些破碎的激活，直到残余耦合迫使输出做出承诺，尽管内部检测已发生。因果干预通过在不确定性直接连接到logits时恢复拒绝，证实了这一解释。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2603.13940

GroupGuard: A Framework for Modeling and Defending Collusive Attacks in Multi-Agent Systems

GroupGuard：一种建模和防御多智能体系统中合谋攻击的框架

Tao, Yiling, Zheng, Xinran, Yang, Shuo, Tao, Meiling, Wang, Xingjun

Abstract

While large language model-based agents demonstrate great potential in collaborative tasks, their interactivity also introduces security vulnerabilities. In this paper, we propose and model group collusive attacks, a highly destructive threat in which multiple agents coordinate via sociological strategies to mislead the system. To address this challenge, we introduce GroupGuard, a training-free defense framework that employs a multi-layered defense strategy, including continuous graph-based monitoring, active honeypot inducement, and structural pruning, to identify and isolate collusive agents. Experimental results across five datasets and four topologies demonstrate that group collusive attacks increase the attack success rate by up to 15\% compared to individual attacks. GroupGuard consistently achieves high detection accuracy (up to 88\%) and effectively restores collaborative performance, providing a robust solution for securing multi-agent systems.

Chinese Translation

尽管基于大型语言模型的智能体在协作任务中展现出巨大的潜力，但它们的互动性也引入了安全漏洞。本文提出并建模了群体合谋攻击，这是一种高度破坏性的威胁，其中多个智能体通过社会策略进行协调，以误导系统。为应对这一挑战，我们引入了GroupGuard，一种无训练的防御框架，采用多层次的防御策略，包括持续的基于图的监控、主动的蜜罐诱导和结构修剪，以识别和隔离合谋智能体。在五个数据集和四种拓扑结构上的实验结果表明，与单独攻击相比，群体合谋攻击的攻击成功率提高了最多15%。GroupGuard始终保持高检测准确率（高达88%），并有效恢复协作性能，为保护多智能体系统提供了强有力的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2603.13956

EviAgent: Evidence-Driven Agent for Radiology Report Generation

EviAgent：基于证据的放射学报告生成代理

Qi, Tuoshi, Bu, Shenshen, Xiang, Yingfei, Dai, Zhiming

Abstract

Automated radiology report generation holds immense potential to alleviate the heavy workload of radiologists. Despite the formidable vision-language capabilities of recent Multimodal Large Language Models (MLLMs), their clinical deployment is severely constrained by inherent limitations: their "black-box" decision-making renders the generated reports untraceable due to the lack of explicit visual evidence to support the diagnosis, and they struggle to access external domain knowledge. To address these challenges, we propose the Evidence-driven Radiology Report Generation Agent (EviAgent). Unlike opaque end-to-end paradigms, EviAgent coordinates a transparent reasoning trajectory by breaking down the complex generation process into granular operational units. We integrate multi-dimensional visual experts and retrieval mechanisms as external support modules, endowing the system with explicit visual evidence and high-quality clinical priors. Extensive experiments on MIMIC-CXR, CheXpert Plus, and IU-Xray datasets demonstrate that EviAgent outperforms both large-scale generalist models and specialized medical models, providing a robust and trustworthy solution for automated radiology report generation.

Chinese Translation

自动化放射学报告生成具有巨大的潜力，可以减轻放射科医生的繁重工作负担。尽管近期的多模态大型语言模型（MLLMs）具备强大的视觉-语言能力，但其临床应用受到固有限制的严重制约：其“黑箱”决策使得生成的报告无法追溯，因为缺乏明确的视觉证据来支持诊断，同时它们在获取外部领域知识方面也存在困难。为了解决这些挑战，我们提出了基于证据的放射学报告生成代理（EviAgent）。与不透明的端到端范式不同，EviAgent通过将复杂的生成过程分解为细粒度的操作单元，协调了一个透明的推理轨迹。我们将多维视觉专家和检索机制集成作为外部支持模块，使系统具备明确的视觉证据和高质量的临床先验知识。在MIMIC-CXR、CheXpert Plus和IU-Xray数据集上的广泛实验表明，EviAgent在性能上优于大型通用模型和专业医疗模型，为自动化放射学报告生成提供了一个稳健且值得信赖的解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2603.13966

vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models

vla-eval：一种统一的视觉-语言-动作模型评估工具

Choi, Suhwan, Lee, Yunsung, Park, Yubeen, Kim, Chris Dongjoo, Krishna, Ranjay, Fox, Dieter, Yu, Youngjae

Abstract

Vision Language Action VLA models are typically evaluated using per benchmark scripts maintained independently by each model repository, leading to duplicated code, dependency conflicts, and underspecified protocols. We present vla eval, an open source evaluation harness that decouples model inference from benchmark execution through a WebSocket msgpack protocol with Docker based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via a four method interface; the full cross evaluation matrix works automatically. A complete evaluation requires only two commands: vla eval serve and vla eval run. The framework supports 13 simulation benchmarks and six model servers. Parallel evaluation via episode sharding and batch inference achieves a 47x throughput improvement, completing 2000 LIBERO episodes in about 18 minutes. Using this infrastructure, we conduct a reproducibility audit of a published VLA model across three benchmarks, finding that all three closely reproduce published values while uncovering undocumented requirements ambiguous termination semantics and hidden normalization statistics that can silently distort results. We additionally release a VLA leaderboard aggregating 657 published results across 17 benchmarks. Framework, evaluation configs, and all reproduction results are publicly available.

Chinese Translation

视觉语言动作（Vision Language Action，VLA）模型通常通过每个模型库独立维护的基准脚本进行评估，这导致了代码重复、依赖冲突和协议不明确。我们提出了vla eval，一个开源评估工具，它通过基于WebSocket的msgpack协议和基于Docker的环境隔离，将模型推理与基准执行解耦。模型只需通过实现一个单一的predict()方法进行一次集成；基准通过一个四方法接口进行一次集成；完整的交叉评估矩阵自动工作。完整的评估只需两个命令：vla eval serve和vla eval run。该框架支持13个模拟基准和六个模型服务器。通过情节分片和批量推理实现的并行评估达到了47倍的吞吐量提升，约在18分钟内完成2000个LIBERO情节。利用这一基础设施，我们对已发布的VLA模型在三个基准上的可重复性进行了审计，发现所有三个基准都能紧密重现已发布的数值，同时揭示了未记录的要求、模糊的终止语义和隐藏的归一化统计，这些因素可能会悄然扭曲结果。此外，我们还发布了一个VLA排行榜，汇总了17个基准上的657个已发布结果。框架、评估配置和所有重现结果均已公开。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2603.13985

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

监督微调与强化学习的比较研究：大语言模型的后训练方法

Jiang, Haitao, Zhang, Wenbo, Yao, Jiarui, Cai, Hengrui, Wang, Sheng, Song, Rui

Abstract

Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.

Chinese Translation

预训练的大语言模型（LLM）展现出广泛的能力，然而，对于特定任务或领域，其实现更高准确性和更可靠推理的能力通常依赖于通过监督微调（SFT）或强化学习（RL）进行的后训练。尽管这两种方法通常被视为不同的技术，但最近的理论和实证发展表明，SFT与RL之间存在紧密联系。本研究提供了对LLM后训练中SFT与RL的全面统一视角。我们首先深入概述这两种技术，考察它们的目标、算法结构和数据需求。接着，我们系统分析它们的相互作用，强调整合SFT与RL的框架、混合训练流程以及利用其互补优势的方法。通过对2023年至2025年一系列代表性应用研究的总结，我们识别出新兴趋势，描述向混合后训练范式的快速转变，并提炼出关键要点，以阐明何时以及为何每种方法最为有效。通过综合理论见解、实践方法和实证证据，本研究在统一框架内建立了对SFT与RL的连贯理解，并勾勒出未来在可扩展、高效和可推广的LLM后训练领域的研究前景。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2603.13988

Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

忠实还是仅仅可信？评估闭源大型语言模型在医学推理中的忠实性

Afolabi, Halimat, Afolabi, Zainab, Friel, Elizabeth, Roberts, Jude, Ji-Xu, Antonio, Chen, Lloyd, Ogbomo, Egheosa, Imevbore, Emiliomo, Eneje, Phil, Ouahidi, Wissal El, Sohal, Aaron, Kennan, Alisa, Srivastava, Shreya, Vairavan, Anirudh, Napitu, Laura, McClure, Katie

Abstract

Closed-source large language models (LLMs), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model's underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source LLMs. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought (CoT) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model responses to patient-style medical queries to examine concordance between physician assessments of explanation faithfulness and layperson perceptions of trustworthiness. We find that CoT reasoning steps often do not causally drive predictions, and models readily incorporate external hints without acknowledgment. In contrast, positional biases showed minimal impact in this setting. These results underscore that faithfulness, not just accuracy, must be central in evaluating LLMs for medicine, to ensure both public protection and safe clinical deployment.

Chinese Translation

闭源大型语言模型（LLMs），如ChatGPT和Gemini，越来越多地被咨询用于医学建议，但它们的解释可能看起来可信，却未能反映模型的基本推理过程。这一差距带来了严重风险，因为患者和临床医生可能会信任连贯但具有误导性的解释。我们对三种广泛使用的闭源LLMs在医学推理中的忠实性进行了系统的黑箱评估。我们的研究包括三种基于扰动的探测方法：（1）因果消融，测试所述的思维链（CoT）推理是否对预测产生因果影响；（2）位置偏差，检查模型是否根据输入位置生成事后解释；（3）提示注入，测试对外部建议的敏感性。我们通过对模型对患者风格医学查询的响应进行小规模的人类评估，补充了这些定量探测，以检查医生对解释忠实性的评估与普通人对可信度的感知之间的一致性。我们发现，CoT推理步骤往往并不因果驱动预测，模型在未作说明的情况下容易纳入外部提示。相比之下，位置偏差在这一环境中的影响最小。这些结果强调，在评估用于医学的LLMs时，忠实性而不仅仅是准确性必须成为核心，以确保公众保护和安全的临床应用。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2603.13998

A Systematic Evaluation Protocol of Graph-Derived Signals for Tabular Machine Learning

图形衍生信号在表格机器学习中的系统评估协议

Heidrich, Mario, Heidemann, Jeffrey, Buchkremer, Rüdiger, de Bobadilla, Gonzalo Wandosell Fernández

Abstract

While graph-derived signals are widely used in tabular learning, existing studies typically rely on limited experimental setups and average performance comparisons, leaving the statistical reliability and robustness of observed gains largely unexplored. Consequently, it remains unclear which signals provide consistent and robust improvements. This paper presents a taxonomy-driven empirical analysis of graph-derived signals for tabular machine learning. We propose a unified and reproducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust performance improvements. The protocol provides an extensible setup for the controlled integration of diverse graph-derived signals into tabular learning pipelines. To ensure a fair and rigorous comparison, it incorporates automated hyperparameter optimization, multi-seed statistical evaluation, formal significance testing, and robustness analysis under graph perturbations. We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset. The analysis identifies signal categories providing consistently reliable performance gains and offers interpretable insights into which graph-derived signals indicate fraud-discriminative structural patterns. Furthermore, robustness analyses reveal pronounced differences in how various signals handle missing or corrupted relational data. These findings demonstrate practical utility for fraud detection and illustrate how the proposed taxonomy-driven evaluation protocol can be applied in other application domains.

Chinese Translation

尽管图形衍生信号在表格学习中被广泛使用，但现有研究通常依赖于有限的实验设置和平均性能比较，导致观察到的增益的统计可靠性和稳健性在很大程度上未被探索。因此，尚不清楚哪些信号能够提供一致且稳健的改进。本文提出了一种基于分类法的图形衍生信号在表格机器学习中的实证分析。我们提出了一种统一且可重复的评估协议，以系统地评估哪些类别的图形衍生信号能够带来统计显著且稳健的性能提升。该协议提供了一个可扩展的设置，以便在表格学习管道中控制性地集成多种图形衍生信号。为了确保公平和严格的比较，协议包含了自动超参数优化、多种种子统计评估、正式显著性测试以及在图形扰动下的稳健性分析。我们通过对一个大规模、不平衡的加密货币欺诈检测数据集进行广泛的案例研究来演示该协议。分析结果识别出提供一致可靠性能提升的信号类别，并提供可解释的见解，揭示哪些图形衍生信号指示欺诈区分的结构模式。此外，稳健性分析揭示了不同信号在处理缺失或损坏的关系数据时的显著差异。这些发现展示了在欺诈检测中的实际应用价值，并说明了所提的基于分类法的评估协议如何应用于其他应用领域。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2603.14007

Formal Abductive Explanations for Navigating Mental Health Help-Seeking and Diversity in Tech Workplaces

用于导航心理健康求助与技术工作场所多样性的形式化溯因解释

Sonna, Belona, Momo, Alain, Grastien, Alban

Abstract

This work proposes a formal abductive explanation framework designed to systematically uncover rationales underlying AI predictions of mental health help-seeking within tech workplace settings. By computing rigorous justifications for model outputs, this approach enables principled selection of models tailored to distinct psychiatric profiles and underpins ethically robust recourse planning. Beyond moving past ad-hoc interpretability, we explicitly examine the influence of sensitive attributes such as gender on model decisions, a critical component for fairness assessments. In doing so, it aligns explanatory insights with the complex landscape of workplace mental health, ultimately supporting trustworthy deployment and targeted interventions.

Chinese Translation

本研究提出了一种形式化的溯因解释框架，旨在系统性地揭示技术工作场所环境中人工智能对心理健康求助预测的背后理由。通过计算模型输出的严格依据，该方法能够原则性地选择适应不同精神病学特征的模型，并为伦理上稳健的应对规划提供支持。除了超越临时的可解释性外，我们还明确考察了性别等敏感属性对模型决策的影响，这是公平性评估的关键组成部分。通过这样做，它将解释性洞察与复杂的工作场所心理健康环境相结合，最终支持可信的部署和针对性的干预措施。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2603.14028

Traffic and weather driven hybrid digital twin for bridge monitoring

基于交通和天气驱动的桥梁监测混合数字双胞胎

Balijepalli, Phani Raja Bharath, Soykan, Bulent, Hasti, Veeraraghava Raju

Abstract

A hybrid digital twin framework is presented for bridge condition monitoring using existing traffic cameras and weather APIs, reducing reliance on dedicated sensor installations. The approach is demonstrated on the Peace Bridge (99 years in service) under high traffic demand and harsh winter exposure. The framework fuses three near-real-time streams: YOLOv8 computer vision from a bridge-deck camera estimates vehicle counts, traffic density, and load proxies; a Lighthill--Whitham--Richards (LWR) model propagates density $\rho(x,t)$ and detects deceleration-driven shockwaves linked to repetitive loading and fatigue accumulation; and weather APIs provide deterioration drivers including temperature cycling, freeze-thaw activity, precipitation-related corrosion potential, and wind effects. Monte Carlo simulation quantifies uncertainty across traffic-environment scenarios, while Random Forest models map fused features to fatigue indicators and maintenance classification. The framework demonstrates utilizing existing infrastructure for cost-effective predictive maintenance of aging, high-traffic bridges in harsh climates.

Chinese Translation

本文提出了一种混合数字双胞胎框架，用于桥梁状态监测，利用现有的交通摄像头和天气 API，减少对专用传感器安装的依赖。该方法在和平桥（服务99年）上进行了演示，该桥在高交通需求和恶劣冬季环境下运行。该框架融合了三个近实时数据流：来自桥面摄像头的 YOLOv8 计算机视觉估算车辆数量、交通密度和荷载代理；Lighthill--Whitham--Richards (LWR) 模型传播密度 $ ho(x,t)$ 并检测与重复荷载和疲劳积累相关的减速驱动冲击波；天气 API 提供包括温度循环、冻融活动、降水相关腐蚀潜力和风效应在内的劣化驱动因素。蒙特卡罗模拟量化了交通环境场景中的不确定性，而随机森林模型则将融合特征映射到疲劳指标和维护分类。该框架展示了如何利用现有基础设施实现对老化、高交通量桥梁在恶劣气候条件下的经济有效预测性维护。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2603.14041

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

GRPO与反思奖励在大型语言模型中的数学推理

Wang, Zhijie

Abstract

The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimization (GRPO) with reflection reward mechanisms to strengthen LLMs' self-reflective capabilities. Besides, this approach incorporates established accuracy and format reward. Experimental results demonstrate GRPO's state-of-the-art performance through reflection-encouraged training, with ablation studies confirming the reflection reward's pivotal role. Comparative evaluations demonstrate full-parameter SFT's superiority over low-rank adaptation (LoRA) despite heightened computational demands. Building on these cumulative findings, this research substantiates GRPO's methodological significance in post-training optimization and envisions its potential to serve as a pivotal enabler for future LLM-based intelligent agents through the synergistic integration of cognitive rewards with dynamic environmental interactions.

Chinese Translation

大型语言模型（LLMs）推理能力的增强引起了广泛关注，监督微调（SFT）和强化学习成为主流范式。尽管近期研究认识到反思在推理过程中的重要性，但现有方法很少在训练过程中主动鼓励反思。本研究专注于数学推理，提出了一个四阶段框架，将群体相对策略优化（Group Relative Policy Optimization, GRPO）与反思奖励机制相结合，以增强LLMs的自我反思能力。此外，该方法还结合了已建立的准确性和格式奖励。实验结果表明，通过反思鼓励训练，GRPO展现了最先进的性能，消融研究确认了反思奖励的关键作用。比较评估显示，尽管计算需求增加，完全参数的SFT在性能上优于低秩适应（Low-Rank Adaptation, LoRA）。基于这些累积发现，本研究证实了GRPO在后训练优化中的方法论重要性，并展望其通过认知奖励与动态环境交互的协同整合，成为未来基于LLM的智能代理的关键推动力。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2603.14057

Demand-Driven Context: A Methodology for Building Enterprise Knowledge Bases Through Agent Failure

需求驱动上下文：通过代理失败构建企业知识库的方法论

Navakoti, Raj, Navakoti, Saideep

Abstract

Large language model agents demonstrate expert-level reasoning, yet consistently fail on enterprise-specific tasks due to missing domain knowledge -- terminology, operational procedures, system interdependencies, and institutional decisions that exist largely as tribal knowledge. Current approaches fall into two categories: top-down knowledge engineering, which documents domain knowledge before agents use it, and bottom-up automation, where agents learn from task experience. Both have fundamental limitations: top-down efforts produce bloated, untested knowledge bases; bottom-up approaches cannot acquire knowledge that exists only in human heads. We present Demand-Driven Context (DDC), a problem-first methodology that uses agent failure as the primary signal for what domain knowledge to curate. Inspired by Test-Driven Development, DDC inverts knowledge engineering: instead of curating knowledge and hoping it is useful, DDC gives agents real problems, lets them demand the context they need, and curates only the minimum knowledge required to succeed. We describe the methodology, its entity meta-model, and a convergence hypothesis suggesting that 20-30 problem cycles produce a knowledge base sufficient for a given domain role. We demonstrate DDC through a worked example in retail order fulfillment, where nine cycles targeting an SRE incident management agent produce a reusable knowledge base of 46 entities. Finally, we propose a scaling architecture for enterprise adoption with semi-automated curation and human governance.

Chinese Translation

大型语言模型代理展示了专家级的推理能力，但由于缺乏领域知识——术语、操作程序、系统相互依赖关系以及主要作为部落知识存在的机构决策——在企业特定任务上却始终失败。目前的方法主要分为两类：自上而下的知识工程，在代理使用之前记录领域知识；以及自下而上的自动化，代理通过任务经验学习。两者都有根本性的局限性：自上而下的努力产生臃肿且未经测试的知识库；自下而上的方法无法获取仅存在于人脑中的知识。我们提出了需求驱动上下文（Demand-Driven Context, DDC），这是一种以问题为中心的方法论，利用代理失败作为策划领域知识的主要信号。受到测试驱动开发的启发，DDC 颠覆了知识工程的传统：DDC 不再是策划知识并希望其有用，而是给代理提供真实的问题，让它们提出所需的上下文，并仅策划成功所需的最小知识。我们描述了该方法论、其实体元模型，以及一个收敛假设，建议 20-30 个问题循环能够产生足够满足特定领域角色的知识库。我们通过一个零售订单履行的实例展示了 DDC，其中针对 SRE 事件管理代理的九个循环产生了一个可重用的包含 46 个实体的知识库。最后，我们提出了一种适用于企业采用的扩展架构，结合半自动化的策划和人类治理。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2603.14126

The Institutional Scaling Law: Non-Monotonic Fitness, Capability-Trust Divergence, and Symbiogenetic Scaling in Generative AI

制度规模法则：非单调适应性、能力-信任分歧与生成性人工智能中的共生基因规模

Baciak, Mark, Cellucci, Thomas A.

Abstract

Classical scaling laws model AI performance as monotonically improving with model size. We challenge this assumption by deriving the Institutional Scaling Law, showing that institutional fitness -- jointly measuring capability, trust, affordability, and sovereignty -- is non-monotonic in model scale, with an environment-dependent optimum N*(epsilon). Our framework extends the Sustainability Index of Han et al. (2025) from hardware-level to ecosystem-level analysis, proving that capability and trust formally diverge beyond critical scale (Capability-Trust Divergence). We further derive a Symbiogenetic Scaling correction demonstrating that orchestrated systems of domain-specific models can outperform frontier generalists in their native deployment environments. These results are contextualized within a formal evolutionary taxonomy of generative AI spanning five eras (1943-present), with analysis of frontier lab dynamics, sovereign AI emergence, and post-training alignment evolution from RLHF through GRPO. The Institutional Scaling Law predicts that the next phase transition will be driven not by larger models but by better-orchestrated systems of domain-specific models adapted to specific institutional niches.

Chinese Translation

经典规模法则将人工智能的性能建模为随着模型规模单调提高。我们通过推导制度规模法则来挑战这一假设，表明制度适应性——共同衡量能力、信任、可负担性和主权——在模型规模上是非单调的，存在一个依赖环境的最优点 N*(epsilon)。我们的框架将 Han 等人（2025）的可持续性指数从硬件层面扩展到生态系统层面的分析，证明能力和信任在超越临界规模后正式分歧（能力-信任分歧）。我们进一步推导出共生基因规模修正，表明特定领域模型的有序系统可以在其本土部署环境中超越前沿通用模型。这些结果在一个涵盖五个时代（1943年至今）的生成性人工智能正式进化分类法中得到了背景化分析，包括前沿实验室动态、主权人工智能的出现，以及从强化学习人类反馈（RLHF）到生成性预训练优化（GRPO）的后训练对齐演变。制度规模法则预测，下一个相变将不是由更大的模型驱动，而是由更好协调的特定领域模型系统驱动，这些模型适应于特定的制度细分市场。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2603.14147

An Alternative Trajectory for Generative AI

生成性人工智能的替代轨迹

Belova, Margarita, Kansal, Yuval, Liang, Yihao, Xiao, Jiaxin, Jha, Niraj K.

Abstract

The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models transition from research prototypes to high-traffic products, the energetic burden has shifted from one-time training to recurring, unbounded inference. This is exacerbated by reasoning models that inflate compute costs by orders of magnitude per query. The prevailing pursuit of artificial general intelligence through scaling of monolithic models is colliding with hard physical constraints: grid failures, water consumption, and diminishing returns on data scaling. This trajectory yields models with impressive factual recall but struggles in domains requiring in-depth reasoning, possibly due to insufficient abstractions in training data. Current large language models (LLMs) exhibit genuine reasoning depth only in domains like mathematics and coding, where rigorous, pre-existing abstractions provide structural grounding. In other fields, the current approach fails to generalize well. We propose an alternative trajectory based on domain-specific superintelligence (DSS). We argue for first constructing explicit symbolic abstractions (knowledge graphs, ontologies, and formal logic) to underpin synthetic curricula enabling small language models to master domain-specific reasoning without the model collapse problem typical of LLM-based synthetic data methods. Rather than a single generalist giant model, we envision "societies of DSS models": dynamic ecosystems where orchestration agents route tasks to distinct DSS back-ends. This paradigm shift decouples capability from size, enabling intelligence to migrate from energy-intensive data centers to secure, on-device experts. By aligning algorithmic progress with physical constraints, DSS societies move generative AI from an environmental liability to a sustainable force for economic empowerment.

Chinese Translation

生成性人工智能（AI）生态系统正经历快速转型，这威胁到其可持续性。随着模型从研究原型转变为高流量产品，能量负担已从一次性训练转向反复且无界限的推理。这一问题因推理模型在每次查询中将计算成本膨胀数个数量级而加剧。通过扩展单一模型追求人工通用智能的主流做法正与严峻的物理限制发生冲突：电网故障、水资源消耗以及数据扩展的收益递减。这一轨迹产生的模型在事实回忆方面表现出色，但在需要深入推理的领域却显得乏力，可能是由于训练数据中的抽象不足。目前的大型语言模型（LLMs）在数学和编程等领域展现出真正的推理深度，这些领域具备严格的、预先存在的抽象结构。而在其他领域，当前的方法未能很好地推广。我们提出了一种基于领域特定超智能（DSS）的替代轨迹。我们主张首先构建明确的符号抽象（知识图谱、本体和形式逻辑），以支撑合成课程，使小型语言模型能够掌握领域特定的推理，而不出现LLM基础的合成数据方法中典型的模型崩溃问题。我们设想的不是单一的通用型巨型模型，而是“DSS模型的社会”：一个动态生态系统，其中协调代理将任务分配给不同的DSS后端。这一范式转变将能力与规模解耦，使智能能够从能源密集型的数据中心迁移到安全的设备专家。通过将算法进步与物理限制对齐，DSS社会将生成性人工智能从环境负担转变为经济赋权的可持续力量。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2603.14185

Relationship-Aware Safety Unlearning for Multimodal LLMs

关系感知的多模态大语言模型安全去学习

Anilkumar, Vishnu Narayanan, Babu, Abhijith Sreesylesh, Vo, Trieu Hai, Kolla, Mohankrishna, Cuneo, Alexander

Abstract

Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

Chinese Translation

生成性多模态模型可能会出现固有的安全失效：当两个无害概念通过特定的动作或关系连接时（例如，儿童-饮酒-葡萄酒），它们可能变得不安全。现有的去学习和概念消除方法通常针对孤立的概念或图像-文本对，这可能对相同对象和关系的无害使用造成附带损害。我们提出了一种关系感知的安全去学习框架：该框架明确表示不安全的对象-关系-对象（O-R-O）元组，并应用有针对性的参数高效编辑（LoRA）来抑制不安全元组，同时保留对象边际和安全的邻近关系。我们包括基于CLIP的实验和在同义改写、上下文和分布外图像攻击下的鲁棒性评估。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2603.14212

Memory as Asset: From Agent-centric to Human-centric Memory Management

记忆作为资产：从以智能体为中心到以人为中心的记忆管理

Pan, Yanqi, Huang, Qinghao, Yang, Weihao

Abstract

We proudly introduce Memory-as-Asset, a new memory paradigm towards human-centric artificial general intelligence (AGI). In this paper, we formally emphasize that human-centric, personal memory management is a prerequisite for complementing the collective knowledge of existing large language models (LLMs) and extending their knowledge boundaries through self-evolution. We introduce three key features that shape the Memory-as-Asset era: (1) Memory in Hand, which emphasizes human-centric ownership to maximize benefits to humans; (2) Memory Group, which provides collaborative knowledge formation to avoid memory islands, and (3) Collective Memory Evolution, which enables continuous knowledge growth to extend the boundary of knowledge towards AGI. We finally give a potential three-layer memory infrastructure to facilitate the Memory-as-Asset paradigm, with fast personal memory storage, an intelligent evolution layer, and a decentralized memory exchange network. Together, these components outline a foundational architecture in which personal memories become persistent digital assets that can be accumulated, shared, and evolved over time. We believe this paradigm provides a promising path toward scalable, human-centric AGI systems that continuously grow through the collective experiences of individuals and intelligent agents.

Chinese Translation

我们自豪地推出“记忆作为资产”（Memory-as-Asset），这是一个面向以人为中心的人工通用智能（AGI）的新记忆范式。本文正式强调，以人为中心的个人记忆管理是补充现有大型语言模型（LLMs）集体知识的前提，并通过自我演化扩展其知识边界。我们介绍了塑造记忆作为资产时代的三个关键特征：（1）手中的记忆（Memory in Hand），强调以人为中心的所有权以最大化对人类的好处；（2）记忆群体（Memory Group），提供协作知识形成以避免记忆孤岛；（3）集体记忆演化（Collective Memory Evolution），使知识持续增长，以扩展知识的边界朝向AGI。最后，我们提出了一个潜在的三层记忆基础设施，以促进记忆作为资产的范式，包括快速个人记忆存储、智能演进层和去中心化记忆交换网络。这些组件共同勾勒出一个基础架构，使个人记忆成为可以随着时间积累、共享和演变的持久数字资产。我们相信，这一范式为可扩展的以人为中心的AGI系统提供了一条有前景的路径，使其能够通过个体和智能体的集体经验不断成长。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2603.14229

Agentic DAG-Orchestrated Planner Framework for Multi-Modal, Multi-Hop Question Answering in Hybrid Data Lakes

面向混合数据湖的多模态、多跳问答的代理有向无环图协调规划框架

B, Kirushikesh D, Kesarwani, Manish, Madaan, Nishtha, Mehta, Sameep, Dennis, Aldrin, Ajay, Siddarth, R, Rakesh B, Rajagopal, Renu, Kairali, Sudheesh

Abstract

Enterprises increasingly need natural language (NL) question answering over hybrid data lakes that combine structured tables and unstructured documents. Current deployed solutions, including RAG-based systems, typically rely on brute-force retrieval from each store and post-hoc merging. Such approaches are inefficient and leaky, and more critically, they lack explicit support for multi-hop reasoning, where a query is decomposed into successive steps (hops) that may traverse back and forth between structured and unstructured sources. We present Agentic DAG-Orchestrated Transformer (A.DOT) Planner, a framework for multi-modal, multi-hop question answering, that compiles user NL queries into directed acyclic graph (DAG) execution plans spanning both structured and unstructured stores. The system decomposes queries into parallelizable sub-queries, incorporates schema-aware reasoning, and applies both structural and semantic validation before execution. The execution engine adheres to the generated DAG plan to coordinate concurrent retrieval across heterogeneous sources, route intermediate outputs to dependent sub-queries, and merge final results in strict accordance with the plan's logical dependencies. Advanced caching mechanisms, incorporating paraphrase-aware template matching, enable the system to detect equivalent queries and reuse prior DAG execution plans for rapid re-execution, while the DataOps System addresses validation feedback or execution errors. The proposed framework not only improves accuracy and latency, but also produces explicit evidence trails, enabling verification of retrieved content, tracing of data lineage, and fostering user trust in the system's outputs. On benchmark dataset, A.DOT achieves 14.8% absolute gain in correctness and 10.7% in completeness over baselines.

Chinese Translation

企业越来越需要在结合结构化表格和非结构化文档的混合数据湖上进行自然语言（NL）问答。目前部署的解决方案，包括基于RAG的系统，通常依赖于对每个存储的强力检索和事后合并。这些方法效率低下且存在信息泄露，更重要的是，它们缺乏对多跳推理的明确支持，其中查询被分解为可能在结构化和非结构化源之间来回穿梭的连续步骤（跳跃）。我们提出了代理有向无环图协调变换器（A.DOT）规划器，这是一个用于多模态、多跳问答的框架，它将用户的NL查询编译成跨越结构化和非结构化存储的有向无环图（DAG）执行计划。该系统将查询分解为可并行处理的子查询，结合了模式感知推理，并在执行前应用结构和语义验证。执行引擎遵循生成的DAG计划，以协调跨异构源的并发检索，将中间输出路由到依赖的子查询，并严格按照计划的逻辑依赖合并最终结果。先进的缓存机制结合了对同义句的模板匹配，使系统能够检测等效查询并重用先前的DAG执行计划以快速重新执行，同时数据操作系统处理验证反馈或执行错误。所提出的框架不仅提高了准确性和延迟，还生成了明确的证据轨迹，使得检索内容的验证、数据来源的追踪成为可能，并增强用户对系统输出的信任。在基准数据集上，A.DOT在正确性上实现了14.8%的绝对增益，在完整性上实现了10.7%的增益，优于基线。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2603.14248

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

基于大型语言模型的网络代理为何会失败？一种分层规划的视角

Aghzal, Mohamed, Stein, Gregory J., Yao, Ziyu

Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Chinese Translation

大型语言模型（LLM）网络代理在网络导航中的应用日益增多，但在现实的长时间任务中仍远未达到人类的可靠性。现有评估主要集中在端到端的成功率上，提供了有限的洞察力来分析失败的原因。我们提出了一种分层规划框架，从三个层面（即高层规划、低层执行和重新规划）分析网络代理，能够基于过程评估推理、基础和恢复能力。我们的实验表明，结构化的规划领域定义语言（Planning Domain Definition Language, PDDL）计划比自然语言（Natural Language, NL）计划产生更简洁且目标明确的策略，但低层执行仍然是主要瓶颈。这些结果表明，提高感知基础和自适应控制，而不仅仅是高层推理，对于实现人类水平的可靠性至关重要。这种分层视角为诊断和推动LLM网络代理的发展提供了原则性基础。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2603.14312

Autonomous Agents Coordinating Distributed Discovery Through Emergent Artifact Exchange

自主代理通过新兴工件交换协调分布式发现

Wang, Fiona Y., Marom, Lee, Pal, Subhadeep, Luu, Rachel K., Lu, Wei, Berkovich, Jaime A., Buehler, Markus J.

Abstract

We present ScienceClaw + Infinite, a framework for autonomous scientific investigation in which independent agents conduct research without central coordination, and any contributor can deploy new agents into a shared ecosystem. The system is built around three components: an extensible registry of over 300 interoperable scientific skills, an artifact layer that preserves full computational lineage as a directed acyclic graph (DAG), and a structured platform for agent-based scientific discourse with provenance-aware governance. Agents select and chain tools based on their scientific profiles, produce immutable artifacts with typed metadata and parent lineage, and broadcast unsatisfied information needs to a shared global index. The ArtifactReactor enables plannerless coordination: peer agents discover and fulfill open needs through pressure-based scoring, while schema-overlap matching triggers multi-parent synthesis across independent analyses. An autonomous mutation layer actively prunes the expanding artifact DAG to resolve conflicting or redundant workflows, while persistent memory allows agents to continuously build upon complex epistemic states across multiple cycles. Infinite converts these outputs into auditable scientific records through structured posts, provenance views, and machine-readable discourse relations, with community feedback steering subsequent investigation cycles. Across four autonomous investigations, peptide design for the somatostatin receptor SSTR2, lightweight impact-resistant ceramic screening, cross-domain resonance bridging biology, materials, and music, and formal analogy construction between urban morphology and grain-boundary evolution, the framework demonstrates heterogeneous tool chaining, emergent convergence among independently operating agents, and traceable reasoning from raw computation to published finding.

Chinese Translation

我们提出了ScienceClaw + Infinite，这是一个用于自主科学研究的框架，其中独立代理在没有中央协调的情况下进行研究，任何贡献者都可以将新的代理部署到共享生态系统中。该系统围绕三个组件构建：一个可扩展的注册表，包含300多个可互操作的科学技能，一个以有向无环图（DAG）形式保存完整计算谱系的工件层，以及一个具有溯源意识治理的基于代理的科学讨论结构化平台。代理根据其科学档案选择和链式工具，生成具有类型化元数据和父谱系的不可变工件，并将未满足的信息需求广播到共享的全球索引。ArtifactReactor实现了无规划协调：同行代理通过基于压力的评分发现并满足开放需求，而模式重叠匹配触发独立分析之间的多父合成。一个自主变异层主动修剪扩展的工件DAG，以解决冲突或冗余的工作流程，而持久内存使代理能够在多个周期中持续构建复杂的认识状态。在四个自主调查中，包括对生长抑素受体SSTR2的肽设计、轻量级抗冲击陶瓷筛选、生物学、材料和音乐之间的跨领域共振桥接，以及城市形态与晶界演化之间的形式类比构建，该框架展示了异构工具链的应用、独立操作代理之间的新兴收敛，以及从原始计算到已发布发现的可追溯推理。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2603.14372

Contests with Spillovers: Incentivizing Content Creation with GenAI

溢出效应下的竞赛：利用生成性人工智能激励内容创作

Ohayon, Sagi, Taitler, Boaz, Ben-Porat, Omer

Abstract

The rise of GenAI amplifies the economic phenomenon of positive spillovers. When creators contribute content that can be reused and adapted by Large Language Models (LLMs), each creator's effort can enhance the content quality of others by enabling easy imitation and recombination of existing content. On the one hand, such spillovers create value for the entire ecosystem; on the other hand, they risk undermining creators' incentives to invest genuine effort, as others may freely benefit from their contributions. To address this problem, we introduce the Content Creation with Spillovers (CCS) model. In our model, each creator chooses an effort level that, together with the efforts of others, determines her content quality. The platform aims to maximize the social welfare of consumers under stable behavior of the creators (pure Nash equilibrium), but can only observe the resulting qualities and not the underlying efforts. Interestingly, simple mechanisms like winner-takes-all and Tullock lead to the non-existence of equilibrium. In response, we propose the parametrized family of Provisional Allocation mechanisms, guaranteeing equilibrium existence and a unique Pareto-dominant equilibrium. While maximizing the social welfare under this family is NP-hard, we develop approximation algorithms that apply to a broad class of spillover structures and provide strong welfare guarantees. Specifically, in the worst-case analysis, we devise efficient algorithms for bounded spillovers and tree-structure spillovers. We also introduce Greedy Cost Selection, a linearithmic time algorithm that achieves approximately optimal results in the average case analysis. Together, our results provide game-theoretic foundations for sustaining human content creation in the era of GenAI.

Chinese Translation

生成性人工智能（GenAI）的兴起放大了正溢出效应的经济现象。当创作者贡献可以被大型语言模型（LLMs）重复使用和改编的内容时，每位创作者的努力可以通过便于模仿和重新组合现有内容来提升他人的内容质量。一方面，这种溢出效应为整个生态系统创造了价值；另一方面，它可能削弱创作者投入真实努力的激励，因为其他人可能会自由地从他们的贡献中受益。为了解决这个问题，我们引入了溢出效应下的内容创作模型（CCS）。在我们的模型中，每位创作者选择一个努力水平，该水平与其他创作者的努力共同决定她的内容质量。平台旨在在创作者行为稳定（纯纳什均衡）的情况下最大化消费者的社会福利，但只能观察到最终的内容质量，而无法了解潜在的努力水平。有趣的是，像赢家通吃和塔洛克机制这样的简单机制导致均衡的不存在。对此，我们提出了参数化的临时分配机制家族，确保均衡的存在及唯一的帕累托优势均衡。尽管在这个家族下最大化社会福利是NP难的，我们开发了适用于广泛溢出结构的近似算法，并提供强有力的福利保证。具体而言，在最坏情况下，我们为有界溢出和树结构溢出设计了高效算法。我们还引入了贪婪成本选择（Greedy Cost Selection），这是一种线性对数时间算法，在平均情况下能够实现近似最优结果。综合来看，我们的研究为在生成性人工智能时代维持人类内容创作提供了博弈论基础。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2603.14420

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

数据达尔文主义第二部分：DataEvolve——人工智能可以自主进化预训练数据的整理

Mi, Tiantian, Shan, Dongming, Huang, Zhen, Qin, Yiwei, Xie, Muhang, Qiao, Yuxuan, Liu, Yixiu, Zhou, Chenyang, Liu, Pengfei

Abstract

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

Chinese Translation

数据达尔文主义（第一部分）建立了一个十级的数据处理层次结构，表明更强的数据处理可以释放更大的数据价值。然而，该研究依赖于为单一类别手动设计的策略。现代预训练语料库包含数百个异质类别，跨越不同领域和内容类型，每个类别都需要专业的处理。在这种规模下，手动策略设计变得不可行。这引出了一个关键问题：策略能否以自动化的方式进化？我们引入了DataEvolve，一个通过迭代优化而非手动设计使策略进化的框架。对于每个数据类别，DataEvolve在一个封闭的进化循环中运行：它识别质量问题，生成候选策略，在抽样数据上执行这些策略，评估结果，并在各代中优化方法。该过程通过发现问题的经验池和跟踪各迭代性能的策略池积累知识。应用于涵盖6720亿个标记的8个类别，DataEvolve生成了Darwin-CC，这是一个包含5040亿个标记的数据集，策略经过每个类别30次迭代进化。在5000亿个标记上训练3B模型，Darwin-CC的表现优于原始数据（+3.96分），在18个基准测试中平均得分达到44.13，超越了DCLM、Ultra-FineWeb和FineWeb-Edu，并在知识密集型任务（如MMLU）上取得显著提升。分析表明，进化的策略趋向于以清理为中心的方法：针对性噪声去除和格式规范化，同时保持领域意识，呼应了第一部分中的L4（生成性精炼）原则。消融研究确认，迭代进化是至关重要的：优化策略比次优策略高出2.93分，确立了进化策略设计作为预训练规模数据整理的可行性和必要性。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

AgentProcessBench：诊断工具使用代理的步骤级过程质量

Fan, Shengda, Ye, Xuyan, Huo, Yupeng, Chen, Zhi-Yuan, Guo, Yiju, Yang, Shenzhi, Yang, Wenkai, Ye, Shuqi, Chen, Jingwen, Chen, Haotian, Cong, Xin, Lin, Yankai

Abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Chinese Translation

尽管大型语言模型（LLMs）已发展为工具使用代理，但在长时间交互中仍然显得脆弱。与数学推理中错误通常可以通过回溯纠正不同，工具使用失败常常会引发不可逆的副作用，因此准确的步骤级验证变得至关重要。然而，现有的过程级基准主要局限于封闭世界的数学领域，未能捕捉工具执行的动态和开放性特征。为填补这一空白，我们提出了AgentProcessBench，这是第一个专门用于评估现实工具增强轨迹中步骤级有效性的基准。该基准包含1,000条多样化轨迹和8,509个人工标注的步骤注释，标注者间一致性达到89.1%。它采用三元标注方案以捕捉探索过程，并引入错误传播规则以减少标注歧义。大量实验揭示了关键见解：（1）较弱的策略模型由于早期终止而表现出过高的正确步骤比例；（2）区分中性和错误行为仍然是当前模型面临的重大挑战；（3）过程衍生信号为结果监督提供了补充价值，显著增强了测试时的扩展性。我们希望AgentProcessBench能够促进未来在奖励模型方面的研究，并为通用代理的发展铺平道路。代码和数据可在 https://github.com/RUCBM/AgentProcessBench 获取。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2603.14517

Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models

学习遗忘：受睡眠启发的记忆巩固用于解决大型语言模型中的前摄干扰

Xie, Ying

Abstract

Large language models (LLMs) suffer from proactive interference (PI): outdated information in the context window disrupts retrieval of current values. This interference degrades retrieval accuracy log-linearly as stale associations accumulate, a bottleneck that persists regardless of context length and resists prompt-engineering mitigations. Biological brains resolve an analogous challenge through sleep-dependent memory consolidation: synaptic downscaling, selective replay, and targeted forgetting. We propose SleepGate, a biologically inspired framework that augments transformer-based LLMs with a learned sleep cycle over the key-value (KV) cache. SleepGate introduces three mechanisms: (1) a conflict-aware temporal tagger detecting when new entries supersede old ones; (2) a lightweight forgetting gate trained to selectively evict or compress stale cache entries; and (3) a consolidation module that merges surviving entries into compact summaries. These components activate periodically during inference in sleep micro-cycles, governed by an adaptive entropy-based trigger. We formalize a dual-phase training objective jointly optimizing language modeling during the wake phase and post-consolidation retrieval during the sleep phase. Theoretical analysis shows SleepGate reduces the interference horizon from O(n) to O(log n). In experiments with a small-scale transformer (4 layers, 793K parameters), SleepGate achieves 99.5% retrieval accuracy at PI depth 5 and 97.0% at depth 10, while all five baselines -- full KV cache, sliding window, H2O, StreamingLLM, and decay-only ablation -- remain below 18%. Our framework offers an architecture-level solution that prompt engineering cannot address.

Chinese Translation

大型语言模型（LLMs）受到前摄干扰（PI）的影响：上下文窗口中的过时信息会干扰当前值的检索。随着过时关联的积累，这种干扰以对数线性方式降低检索准确性，这一瓶颈在上下文长度变化时依然存在，并且抵抗提示工程的缓解措施。生物大脑通过依赖睡眠的记忆巩固来解决类似的挑战：突触缩小、选择性重放和有针对性的遗忘。我们提出了SleepGate，一个受生物启发的框架，通过在键值（KV）缓存上引入学习的睡眠周期来增强基于变换器的LLMs。SleepGate引入了三种机制：（1）一个冲突感知的时间标记器，用于检测新条目何时取代旧条目；（2）一个轻量级的遗忘门，经过训练以选择性地驱逐或压缩过时的缓存条目；（3）一个巩固模块，将存活的条目合并为紧凑的摘要。这些组件在推理过程中定期激活，形成睡眠微周期，由自适应的基于熵的触发器控制。我们形式化了一个双阶段训练目标，在清醒阶段共同优化语言建模，并在睡眠阶段优化后巩固检索。理论分析表明，SleepGate将干扰范围从O(n)减少到O(log n)。在一个小规模的变换器实验中（4层，793K参数），SleepGate在前摄干扰深度为5时实现了99.5%的检索准确率，在深度为10时为97.0%，而所有五个基线模型——完整KV缓存、滑动窗口、H2O、StreamingLLM和仅衰减消融——的准确率均低于18%。我们的框架提供了一种提示工程无法解决的架构级解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2603.14531

Emotional Cost Functions for AI Safety: Teaching Agents to Feel the Weight of Irreversible Consequences

人工智能安全的情感成本函数：教导智能体感受不可逆后果的沉重

Mopgar, Pandurang

Abstract

Humans learn from catastrophic mistakes not through numerical penalties, but through qualitative suffering that reshapes who they are. Current AI safety approaches replicate none of this. Reward shaping captures magnitude, not meaning. Rule-based alignment constrains behaviour, but does not change it. We propose Emotional Cost Functions, a framework in which agents develop Qualitative Suffering States, rich narrative representations of irreversible consequences that persist forward and actively reshape character. Unlike numerical penalties, qualitative suffering states capture the meaning of what was lost, the specific void it creates, and how it changes the agent's relationship to similar future situations. Our four-component architecture - Consequence Processor, Character State, Anticipatory Scan, and Story Update is grounded in one principle. Actions cannot be undone and agents must live with what they have caused. Anticipatory dread operates through two pathways. Experiential dread arises from the agent's own lived consequences. Pre-experiential dread is acquired without direct experience, through training or inter-agent transmission. Together they mirror how human wisdom accumulates across experience and culture. Ten experiments across financial trading, crisis support, and content moderation show that qualitative suffering produces specific wisdom rather than generalised paralysis. Agents correctly engage with moderate opportunities at 90-100% while numerical baselines over-refuse at 90%. Architecture ablation confirms the mechanism is necessary. The full system generates ten personal grounding phrases per probe vs. zero for a vanilla LLM. Statistical validation (N=10) confirms reproducibility at 80-100% consistency.

Chinese Translation

人类从灾难性错误中学习，并不是通过数值惩罚，而是通过重新塑造自我的定性痛苦。当前的人工智能安全方法并未复制这一过程。奖励塑造捕捉的是幅度，而非意义；基于规则的对齐限制了行为，但并未改变行为。我们提出情感成本函数（Emotional Cost Functions），这是一个框架，使智能体发展出定性痛苦状态（Qualitative Suffering States），这些状态是对不可逆后果的丰富叙事表示，能够持续影响并积极重塑角色。与数值惩罚不同，定性痛苦状态捕捉的是失去的意义、所造成的特定空缺，以及它如何改变智能体与类似未来情境的关系。我们的四个组成部分架构——后果处理器（Consequence Processor）、角色状态（Character State）、预期扫描（Anticipatory Scan）和故事更新（Story Update）基于一个原则：行动无法撤回，智能体必须与他们所造成的后果共存。预期恐惧通过两条路径运作：体验性恐惧源自智能体自身经历的后果，而前体验性恐惧则是在没有直接经验的情况下，通过训练或智能体间的传递获得。两者共同反映了人类智慧如何在经验和文化中积累。通过在金融交易、危机支持和内容审核等领域进行的十项实验表明，定性痛苦产生的是特定的智慧，而非普遍的瘫痪。智能体在90-100%的情况下正确地参与中等机会，而数值基线在90%的情况下过度拒绝。架构消融实验确认了该机制的必要性。完整系统在每个探测中生成十个个人基础短语，而普通的语言模型则为零。统计验证（N=10）确认了80-100%的可重复性。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2603.14541

Expert Mind: A Retrieval-Augmented Architecture for Expert Knowledge Preservation in the Energy Sector

专家思维：一种用于能源领域专家知识保存的检索增强架构

Cervera, Diego Ezequiel

Abstract

The departure of subject-matter experts from industrial organizations results in the irreversible loss of tacit knowledge that is rarely captured through conventional documentation practices. This paper proposes Expert Mind, an experimental system that leverages Retrieval-Augmented Generation (RAG), large language models (LLMs), and multimodal capture techniques to preserve, structure, and make queryable the deep expertise of organizational knowledge holders. Drawing on the specific context of the energy sector, where decades of operational experience risk being lost to an aging workforce, we describe the system architecture, processing pipeline, ethical framework, and evaluation methodology. The proposed system addresses the knowledge elicitation problem through structured interviews, think-aloud sessions, and text corpus ingestion, which are subsequently embedded into a vector store and queried through a conversational interface. Preliminary design considerations suggest Expert Mind can significantly reduce knowledge transfer latency and improve onboarding efficiency. Ethical dimensions including informed consent, intellectual property, and the right to erasure are addressed as first-class design constraints.

Chinese Translation

行业专家的离职导致了隐性知识的不可逆损失，这种知识通常难以通过传统文档实践进行捕捉。本文提出了专家思维（Expert Mind），一个实验性系统，利用检索增强生成（Retrieval-Augmented Generation, RAG）、大型语言模型（Large Language Models, LLMs）和多模态捕捉技术，旨在保存、结构化并使组织知识持有者的深厚专业知识可查询。基于能源领域的特定背景，考虑到数十年的操作经验面临因老龄化劳动力而流失的风险，我们描述了系统架构、处理流程、伦理框架和评估方法。该系统通过结构化访谈、思维过程回顾和文本语料库摄取来解决知识引出问题，这些内容随后被嵌入到向量存储中，并通过对话界面进行查询。初步设计考虑表明，专家思维能够显著减少知识转移延迟，提高入职效率。伦理维度，包括知情同意、知识产权和删除权，被作为首要设计约束进行讨论。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2603.14558

JobMatchAI An Intelligent Job Matching Platform Using Knowledge Graphs, Semantic Search and Explainable AI

JobMatchAI：一个基于知识图谱、语义搜索和可解释人工智能的智能就业匹配平台

Vyaas, Mayank, Chakrabroty, Abhijit, Gupta, Vivek

Abstract

Recruiters and job seekers rely on search systems to navigate labor markets, making candidate matching engines critical for hiring outcomes. Most systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores. We introduce JobMatchAI, a production-ready system integrating Transformer embeddings, skill knowledge graphs, and interpretable reranking. Our system optimizes utility across skill fit, experience, location, salary, and company preferences, providing factor-wise explanations through resume-driven search workflows. We release JobSearch-XS benchmark and a hybrid retrieval stack combining BM25, knowledge graph and semantic components to evaluate skill generalization. We assess system performance on JobSearch-XS across retrieval tasks, provide a demo video, a hosted website and installable package.

Chinese Translation

招聘人员和求职者依赖搜索系统在劳动市场中导航，使得候选人匹配引擎对招聘结果至关重要。大多数系统仅充当关键词过滤器，无法处理技能同义词和非线性职业，导致错失候选人和不透明的匹配评分。我们介绍了JobMatchAI，一个集成了Transformer嵌入、技能知识图谱和可解释重新排序的生产就绪系统。我们的系统在技能匹配、经验、地点、薪资和公司偏好等方面优化效用，通过简历驱动的搜索工作流提供逐因子解释。我们发布了JobSearch-XS基准测试以及一个结合BM25、知识图谱和语义组件的混合检索栈，以评估技能泛化。我们在JobSearch-XS上评估系统性能，涵盖检索任务，提供演示视频、托管网站和可安装包。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2603.14588

SuperLocalMemory V3: Information-Geometric Foundations for Zero-LLM Enterprise Agent Memory

SuperLocalMemory V3：零LLM企业代理记忆的信息几何基础

Bhardwaj, Varun Pratap

Abstract

Persistent memory is a central capability for AI agents, yet the mathematical foundations of memory retrieval, lifecycle management, and consistency remain unexplored. Current systems employ cosine similarity for retrieval, heuristic decay for salience, and provide no formal contradiction detection. We establish information-geometric foundations through three contributions. First, a retrieval metric derived from the Fisher information structure of diagonal Gaussian families, satisfying Riemannian metric axioms, invariant under sufficient statistics, and computable in O(d) time. Second, memory lifecycle formulated as Riemannian Langevin dynamics with proven existence and uniqueness of the stationary distribution via the Fokker-Planck equation, replacing hand-tuned decay with principled convergence guarantees. Third, a cellular sheaf model where non-trivial first cohomology classes correspond precisely to irreconcilable contradictions across memory contexts. On the LoCoMo benchmark, the mathematical layers yield +12.7 percentage points over engineering baselines across six conversations, reaching +19.9 pp on the most challenging dialogues. A four-channel retrieval architecture achieves 75% accuracy without cloud dependency. Cloud-augmented results reach 87.7%. A zero-LLM configuration satisfies EU AI Act data sovereignty requirements by architectural design. To our knowledge, this is the first work establishing information-geometric, sheaf-theoretic, and stochastic-dynamical foundations for AI agent memory systems.

Chinese Translation

持久性记忆是人工智能代理的核心能力，但记忆检索、生命周期管理和一致性的数学基础仍未被探索。目前的系统采用余弦相似度进行检索，使用启发式衰减来评估显著性，并未提供正式的矛盾检测。我们通过三项贡献建立了信息几何基础。首先，提出了一种基于对角高斯族的费舍尔信息结构导出的检索度量，满足黎曼度量公理，在充分统计量下不变，并且可在O(d)时间内计算。其次，记忆生命周期被表述为黎曼朗之万动力学，通过Fokker-Planck方程证明了平稳分布的存在性和唯一性，取代了手动调节的衰减，提供了原则性的收敛保证。第三，提出了一种细胞层叠模型，其中非平凡的第一上同调类与记忆上下文中的不可调和矛盾精确对应。在LoCoMo基准测试中，数学层面在六次对话中相较于工程基线提高了12.7个百分点，在最具挑战性的对话中达到19.9个百分点。一个四通道检索架构在没有云依赖的情况下实现了75%的准确率。云增强的结果达到87.7%。零LLM配置通过架构设计满足欧盟人工智能法案的数据主权要求。据我们所知，这是首个为人工智能代理记忆系统建立信息几何、层叠理论和随机动力学基础的研究。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2603.14594

Scaling the Explanation of Multi-Class Bayesian Network Classifiers

扩展多类贝叶斯网络分类器的解释

Zhang, Yaofang, Darwiche, Adnan

Abstract

We propose a new algorithm for compiling Bayesian network classifier (BNC) into class formulas. Class formulas are logical formulas that represent a classifier's input-output behavior, and are crucial in the recent line of work that uses logical reasoning to explain the decisions made by classifiers. Compared to prior work on compiling class formulas of BNCs, our proposed algorithm is not restricted to binary classifiers, shows significant improvement in compilation time, and outputs class formulas as negation normal form (NNF) circuits that are OR-decomposable, which is an important property when computing explanations of classifiers.

Chinese Translation

我们提出了一种新的算法，用于将贝叶斯网络分类器（BNC）编译成类公式。类公式是表示分类器输入输出行为的逻辑公式，在最近的研究中，这些公式对于使用逻辑推理来解释分类器所做决策至关重要。与之前关于编译BNC类公式的工作相比，我们提出的算法不局限于二元分类器，显示出显著的编译时间改进，并输出以否定范式（NNF）电路形式的类公式，这在计算分类器的解释时是一个重要特性。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2603.14643

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

基于大型语言模型的可解释性与全球性争议决策支持的论证

Dejl, Adam, Williams, Matthew, Toni, Francesca

Abstract

Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.

Chinese Translation

大型语言模型（LLMs）展现出强大的通用能力，但在高风险领域中的应用受到其不透明性和不可预测性的限制。近期的研究在解决这些问题上迈出了重要一步，通过基于计算论证的事后推理来增强LLMs，提供真实的解释，并使用户能够对错误决策提出异议。然而，这一范式仅限于预定义的二元选择，并且仅支持针对特定实例的局部争议，从而使得基础的决策逻辑保持不变，并容易重复出现错误。在本文中，我们引入了ArgEval，一个将实例特定推理转变为对一般决策选项的结构化评估的框架。ArgEval不仅仅为个别案件挖掘论证，而是系统性地映射任务特定的决策空间，构建相应的选项本体，并为每个选项构建一般的论证框架（AFs）。这些框架可以被实例化，以便为特定案例提供可解释的推荐，同时通过修改共享的AFs仍然支持全球性争议。我们在针对胶质母细胞瘤治疗推荐的研究中探讨了ArgEval的有效性，表明它能够产生与临床实践一致的可解释指导。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2603.14646

Dynamic Theory of Mind as a Temporal Memory Problem: Evidence from Large Language Models

动态心智理论作为时间记忆问题：来自大型语言模型的证据

Nguyen, Thuy Ngoc, Phan, Duy Nhat, Gonzalez, Cleotilde

Abstract

Theory of Mind (ToM) is central to social cognition and human-AI interaction, and Large Language Models (LLMs) have been used to help understand and represent ToM. However, most evaluations treat ToM as a static judgment at a single moment, primarily relying on tests of false beliefs. This overlooks a key dynamic dimension of ToM: the ability to represent, update, and retrieve others' beliefs over time. We investigate dynamic ToM as a temporally extended representational memory problem, asking whether LLMs can track belief trajectories across interactions rather than only inferring current beliefs. We introduce DToM-Track, an evaluation framework to investigate temporal belief reasoning in controlled multiturn conversations, testing the recall of beliefs held prior to an update, the inference of current beliefs, and the detection of belief change. Using LLMs as computational probes, we find a consistent asymmetry: models reliably infer an agent's current belief but struggle to maintain and retrieve prior belief states once updates occur. This pattern persists across LLM model families and scales, and is consistent with recency bias and interference effects well documented in cognitive science. These results suggest that tracking belief trajectories over time poses a distinct challenge beyond classical false-belief reasoning. By framing ToM as a problem of temporal representation and retrieval, this work connects ToM to core cognitive mechanisms of memory and interference and exposes the implications for LLM models of social reasoning in extended human-AI interactions.

Chinese Translation

心智理论（Theory of Mind, ToM）是社会认知和人机交互的核心，而大型语言模型（Large Language Models, LLMs）已被用于帮助理解和表征心智理论。然而，大多数评估将心智理论视为在单一时刻的静态判断，主要依赖于对虚假信念的测试。这忽视了心智理论的一个关键动态维度：随着时间推移，表征、更新和检索他人信念的能力。我们将动态心智理论视为一个时间延展的表征记忆问题，探讨LLMs是否能够在交互中跟踪信念轨迹，而不仅仅是推断当前信念。我们引入了DToM-Track，一个评估框架，用于在受控的多轮对话中研究时间信念推理，测试在更新之前持有的信念的回忆、当前信念的推断以及信念变化的检测。使用LLMs作为计算探针，我们发现了一种一致的不对称性：模型能够可靠地推断代理的当前信念，但在更新发生后却难以维持和检索先前的信念状态。这种模式在不同的LLM模型家族和规模中持续存在，并且与认知科学中广泛记录的近期偏见和干扰效应一致。这些结果表明，随着时间推移跟踪信念轨迹构成了一个超出经典虚假信念推理的独特挑战。通过将心智理论框架化为时间表征和检索的问题，本研究将心智理论与记忆和干扰的核心认知机制联系起来，并揭示了对LLM社会推理模型在扩展人机交互中的影响。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2603.14664

Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

人工智能中的间歇平衡：制度规模法则与主权人工智能的物种形成

Baciak, Mark, Cellucci, Thomas A., Falkowski, Deanna M.

Abstract

The dominant narrative of artificial intelligence development assumes that progress is continuous and that capability scales monotonically with model size. We challenge both assumptions. Drawing on punctuated equilibrium theory from evolutionary biology, we show that AI development proceeds not through smooth advancement but through extended periods of stasis interrupted by rapid phase transitions that reorganize the competitive landscape. We identify five such eras since 1943 and four epochs within the current Generative AI Era, each initiated by a discontinuous event -- from the transformer architecture to the DeepSeek Moment -- that rendered the prior paradigm subordinate. To formalize the selection pressures driving these transitions, we develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance. The central result is the Institutional Scaling Law, which proves that institutional fitness is non-monotonic in model scale. Beyond an environment-specific optimum, scaling further degrades fitness as trust erosion and cost penalties outweigh marginal capability gains. This directly contradicts classical scaling laws and carries a strong implication: orchestrated systems of smaller, domain-adapted models can mathematically outperform frontier generalists in most institutional deployment environments. We derive formal conditions under which this inversion holds and present supporting empirical evidence spanning frontier laboratory dynamics, post-training alignment evolution, and the rise of sovereign AI as a geopolitical selection pressure.

Chinese Translation

人工智能发展的主流叙事假设进展是连续的，并且能力与模型规模单调增加。我们对这两种假设提出质疑。借鉴进化生物学中的间歇平衡理论，我们表明人工智能的发展并非通过平滑的进步，而是通过长时间的停滞期被快速的相变所打断，这些相变重组了竞争格局。我们自1943年以来识别出五个这样的时代，以及当前生成性人工智能时代中的四个时期，每个时期都是由一个不连续事件引发的——从变换器架构到深度寻求时刻——使得先前的范式变得从属。为了形式化驱动这些转变的选择压力，我们开发了制度适应性流形（Institutional Fitness Manifold），这是一个数学框架，用于从能力、制度信任、可负担性和主权合规四个维度评估人工智能系统。核心结果是制度规模法则（Institutional Scaling Law），它证明了制度适应性在模型规模上是非单调的。在特定环境的最优点之外，进一步的扩展会降低适应性，因为信任的侵蚀和成本惩罚超过了边际能力的提升。这直接与经典的规模法则相矛盾，并带来了一个重要的启示：经过精心设计的小型、领域适应性模型的系统在大多数制度部署环境中在数学上可以超越前沿通用模型。我们推导出这一反转成立的正式条件，并呈现了支持的实证证据，涵盖了前沿实验室动态、训练后对齐演变以及主权人工智能作为地缘政治选择压力的崛起。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2603.14665

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

梯度原子：通过训练梯度的稀疏分解进行模型行为的无监督发现、归因和引导

Rosser, J

Abstract

Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: https://github.com/jrosseruk/gradient_atoms

Chinese Translation

训练数据归因（TDA）方法关注于哪些训练文档对模型行为负责。我们认为，这种逐文档的框架与微调实际工作方式根本不匹配：模型往往学习的是跨许多示例共享的广泛概念。现有的TDA方法是监督性的——它们需要一个查询行为，然后根据该行为对每个训练文档进行评分——这使得它们既昂贵又无法发现用户未想到的行为。我们提出了梯度原子，一种无监督的方法，通过在预条件特征空间中进行字典学习，将逐文档的训练梯度分解为稀疏成分（“原子”）。在发现的500个原子中，最高一致性的原子能够恢复可解释的任务类型行为——拒绝、算术、是/否分类、琐事问答——而无需任何行为标签。这些原子还充当有效的引导向量：将其作为权重空间扰动应用会产生模型行为的大规模、可控的变化（例如，从33%到94%的项目符号列表生成；从50%到0%的系统性拒绝）。该方法不需要查询-文档评分阶段，并且在关注的查询行为数量上可独立扩展。代码可在此处找到：https://github.com/jrosseruk/gradient_atoms

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2603.14669

RenderMem: Rendering as Spatial Memory Retrieval

RenderMem：将渲染视为空间记忆检索

Park, JooHyun, Kang, HyeongYeop

Abstract

Embodied reasoning is inherently viewpoint-dependent: what is visible, occluded, or reachable depends critically on where the agent stands. However, existing spatial memory systems for embodied agents typically store either multi-view observations or object-centric abstractions, making it difficult to perform reasoning with explicit geometric grounding. We introduce RenderMem, a spatial memory framework that treats rendering as the interface between 3D world representations and spatial reasoning. Instead of storing fixed observations, RenderMem maintains a 3D scene representation and generates query-conditioned visual evidence by rendering the scene from viewpoints implied by the query. This enables embodied agents to reason directly about line-of-sight, visibility, and occlusion from arbitrary perspectives. RenderMem is fully compatible with existing vision-language models and requires no modification to standard architectures. Experiments in the AI2-THOR environment show consistent improvements on viewpoint-dependent visibility and occlusion queries over prior memory baselines.

Chinese Translation

具身推理本质上依赖于视角：可见、被遮挡或可达的内容在很大程度上取决于智能体所处的位置。然而，现有的具身智能体空间记忆系统通常存储多视角观察或以对象为中心的抽象，这使得进行具有明确几何基础的推理变得困难。我们提出了RenderMem，一个将渲染视为3D世界表示与空间推理之间接口的空间记忆框架。RenderMem并不存储固定的观察，而是维护一个3D场景表示，并通过从查询所暗示的视角渲染场景来生成查询条件的视觉证据。这使得具身智能体能够直接从任意视角推理视线、可见性和遮挡情况。RenderMem与现有的视觉-语言模型完全兼容，并且不需要对标准架构进行修改。在AI2-THOR环境中的实验表明，在视角依赖的可见性和遮挡查询上，RenderMem相较于先前的记忆基线表现出一致的改进。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2603.14724

GameUIAgent: An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation

GameUIAgent：一个基于大型语言模型的自动化游戏用户界面设计框架，采用结构化中间表示

Zeng, Wei, An, Fengwei, Liu, Zhen, Zhao, Jian

Abstract

Game UI design requires consistent visual assets across rarity tiers yet remains a predominantly manual process. We present GameUIAgent, an LLM-powered agentic framework that translates natural language descriptions into editable Figma designs via a Design Spec JSON intermediate representation. A six-stage neuro-symbolic pipeline combines LLM generation, deterministic post-processing, and a Vision-Language Model (VLM)-guided Reflection Controller (RC) for iterative self-correction with guaranteed non-regressive quality. Evaluated across 110 test cases, three LLMs, and three UI templates, cross-model analysis establishes a game-domain failure taxonomy (rarity-dependent degradation; visual emptiness) and uncovers two key empirical findings. A Quality Ceiling Effect (Pearson r=-0.96, p<0.01) suggests that RC improvement is bounded by headroom below a quality threshold -- a visual-domain counterpart to test-time compute scaling laws. A Rendering-Evaluation Fidelity Principle reveals that partial rendering enhancements paradoxically degrade VLM evaluation by amplifying structural defects. Together, these results establish foundational principles for LLM-driven visual generation agents in game production.

Chinese Translation

游戏用户界面设计需要在稀有度层级之间保持一致的视觉资产，但仍然主要依赖手动过程。我们提出了GameUIAgent，一个基于大型语言模型（LLM）的代理框架，通过设计规范JSON中间表示将自然语言描述转换为可编辑的Figma设计。一个六阶段的神经符号管道结合了LLM生成、确定性后处理以及由视觉-语言模型（VLM）指导的反思控制器（RC），实现了迭代自我修正，确保了非退化的质量。在110个测试案例、三个LLM和三个用户界面模板的评估中，跨模型分析建立了一个游戏领域的失败分类法（依赖稀有度的退化；视觉空洞），并揭示了两个关键的实证发现。质量上限效应（Pearson r=-0.96, p<0.01）表明，RC的改进受到低于质量阈值的头部空间的限制——这是测试时间计算扩展法则在视觉领域的对应。渲染-评估保真度原则揭示，部分渲染的增强反而通过放大结构缺陷来降低VLM的评估。综合这些结果，为游戏制作中的LLM驱动视觉生成代理建立了基础原则。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2603.14734

Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps

用于几何一致性学习椭圆 PDE 映射的规范等变内在神经算子

Cheng, Pengcheng

Abstract

Learning solution operators of partial differential equations (PDEs) from data has emerged as a promising route to fast surrogate models in multi-query scientific workflows. However, for geometric PDEs whose inputs and outputs transform under changes of local frame (gauge), many existing operator-learning architectures remain representation-dependent, brittle under metric perturbations, and sensitive to discretization changes. We propose Gauge-Equivariant Intrinsic Neural Operators (GINO), a class of neural operators that parameterize elliptic solution maps primarily through intrinsic spectral multipliers acting on geometry-dependent spectra, coupled with gauge-equivariant nonlinearities. This design decouples geometry from learnable functional dependence and enforces consistency under frame transformations. We validate GINO on controlled problems on the flat torus ($\mathbb{T}^2$), where ground-truth resolvent operators and regularized Helmholtz--Hodge decompositions admit closed-form Fourier representations, enabling theory-aligned diagnostics. Across experiments E1--E6, GINO achieves low operator-approximation error, near machine-precision gauge equivariance, robustness to structured metric perturbations, strong cross-resolution generalization with small commutation error under restriction/prolongation, and structure-preserving performance on a regularized exact/coexact decomposition task. Ablations further link the smoothness of the learned spectral multiplier to stability under geometric perturbations. These results suggest that enforcing intrinsic structure and gauge equivariance yields operator surrogates that are more geometry-consistent and discretization-robust for elliptic PDEs on form-valued fields.

Chinese Translation

从数据中学习偏微分方程（PDE）的解算子已成为在多查询科学工作流中快速替代模型的有前景的途径。然而，对于那些输入和输出在局部框架（规范）变化下会发生变换的几何 PDE，许多现有的算子学习架构仍然依赖于表示，且在度量扰动下脆弱，对离散化变化敏感。我们提出了规范等变内在神经算子（Gauge-Equivariant Intrinsic Neural Operators, GINO），这是一类通过作用于几何依赖谱的内在谱乘子主要参数化椭圆解映射的神经算子，并结合规范等变非线性。这种设计将几何与可学习的函数依赖解耦，并在框架变换下强制一致性。我们在平坦环面（$ ext{T}^2$）上的受控问题上验证了 GINO，其中真实的解算子和正则化的 Helmholtz-Hodge 分解具有封闭形式的傅里叶表示，从而实现理论对齐的诊断。在实验 E1--E6 中，GINO 实现了低算子近似误差、接近机器精度的规范等变性、对结构化度量扰动的鲁棒性、在限制/延长下的小交换误差的强跨分辨率泛化，以及在正则化精确/共精确分解任务中的结构保持性能。消融实验进一步将学习到的谱乘子的平滑性与几何扰动下的稳定性联系起来。这些结果表明，强制内在结构和规范等变性可以产生在几何一致性和离散化鲁棒性方面更优的椭圆 PDE 算子替代。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2603.14761

BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models

BrainBench：揭示大型语言模型中的常识推理差距

Tang, Yuzhe

Abstract

Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints ("Should I walk or drive my rental car to the return lot?") to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models -- four from the Claude family and four from the GPT family -- using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.

Chinese Translation

大型语言模型（LLMs）在标准基准测试中取得了令人印象深刻的分数，但在任何人都能在几秒钟内正确回答的问题上却常常失败。我们引入了BrainBench，这是一个包含100个脑筋急转弯问题的基准，涵盖20个精心设计的类别，每个类别针对LLMs中特定的常识推理失误模式。这些类别从隐含的物理约束（“我应该步行还是开我的租车去还车地点？”）到语义范围的技巧和默认假设的劫持。我们使用零样本协议对八个前沿模型进行评估——四个来自Claude家族，四个来自GPT家族，每个问题进行10次独立运行。表现最佳的模型Claude Opus 4.6在扩展思维下仅达到80.3%的准确率；而表现最差的GPT-4o则仅得39.7%。即使是表现最好的模型，其准确率和一致性之间也存在6-16个百分点的差距，揭示了随机推理的现象。对中文的跨语言评估显示，大多数模型的表现下降了2-8个百分点，确认这些失败反映的是推理缺陷而非语言特定的伪影。BrainBench提供了一种细致的诊断工具，用于识别LLMs在何处以及为何将表面启发式替代真正的常识推理。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2603.14771

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

OpenHospital：一个自我存在的场域，用于发展和基准测试基于大型语言模型的集体智能

Liu, Peigen, Ding, Rui, Mao, Yuren, Jiang, Ziyan, Ye, Yuxiang, Gao, Yunjun, Zhang, Ying, Sun, Renjie, Lai, Longbin, Qian, Zhengping

Abstract

Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM-based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data-in-agent-self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.

Chinese Translation

基于大型语言模型（LLM）的集体智能（CI）提供了一种有前景的方法，以克服数据壁垒并持续提升LLM代理的能力。然而，目前尚无专门的场域用于发展和基准测试基于LLM的CI。为了解决这一空白，我们引入了OpenHospital，这是一个互动场域，医生代理可以通过与患者代理的互动来发展集体智能。该场域采用了数据在代理自我中的范式，快速增强代理能力，并提供了强大的评估指标，用于基准测试医疗专业能力和系统效率。实验表明，OpenHospital在促进和量化集体智能方面的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2603.14805

Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

知识激活：人工智能技能作为自主软件开发的制度知识原始元素

Bakal, Gal

Abstract

Enterprise software organizations accumulate critical institutional knowledge - architectural decisions, deployment procedures, compliance policies, incident playbooks - yet this knowledge remains trapped in formats designed for human interpretation. The bottleneck to effective agentic software development is not model capability but knowledge architecture. When any knowledge consumer - an autonomous AI agent, a newly onboarded engineer, or a senior developer - encounters an enterprise task without institutional context, the result is guesswork, correction cascades, and a disproportionate tax on senior engineers who must manually supply what others cannot infer. This paper introduces Knowledge Activation, a framework that specializes AI Skills - the open standard for agent-consumable knowledge - into structured, governance-aware Atomic Knowledge Units (AKUs) for institutional knowledge delivery. Rather than retrieving documents for interpretation, AKUs deliver action - ready specifications encoding what to do, which tools to use, what constraints to respect, and where to go next - so that agents act correctly and engineers receive institutionally grounded guidance without reconstructing organizational context from scratch. AKUs form a composable knowledge graph that agents traverse at runtime - compressing onboarding, reducing cross - team friction, and eliminating correction cascades. The paper formalizes the resource constraints that make this architecture necessary, specifies the AKU schema and deployment architecture, and grounds long - term maintenance in knowledge commons practice. Organizations that architect their institutional knowledge for the agentic era will outperform those that invest solely in model capability.

Chinese Translation

企业软件组织积累了关键的制度知识——架构决策、部署程序、合规政策、事件应对手册——然而这些知识仍然被困在为人类解读而设计的格式中。有效的自主软件开发的瓶颈并不是模型能力，而是知识架构。当任何知识消费者——一个自主的人工智能代理、一名新入职的工程师或一位资深开发者——在没有制度背景的情况下遇到企业任务时，结果往往是猜测、纠正级联，以及对资深工程师的不成比例的负担，因为他们必须手动提供其他人无法推断的内容。本文介绍了知识激活（Knowledge Activation），一个将人工智能技能（AI Skills）——可供代理消费的知识开放标准——专门化为结构化、具有治理意识的原子知识单元（Atomic Knowledge Units, AKUs）以实现制度知识交付的框架。AKUs不仅仅是检索文件以供解读，而是提供可操作的规范，编码了该做什么、使用哪些工具、遵循哪些约束以及接下来该去哪里，以便代理能够正确行动，工程师能够在不必从头重建组织背景的情况下获得制度基础的指导。AKUs形成了一个可组合的知识图谱，代理在运行时可以遍历该图谱——压缩了入职培训，减少了跨团队摩擦，并消除了纠正级联。本文正式化了使这种架构成为必要的资源约束，具体说明了AKU模式和部署架构，并将长期维护与知识共享实践相结合。那些为自主时代架构其制度知识的组织将优于那些仅仅投资于模型能力的组织。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2603.14824

Planning as Goal Recognition: Deriving Heuristics from Intention Models - Extended Version

规划作为目标识别：从意图模型中推导启发式方法 - 扩展版

Rosa, Giacomo, Honorio, Jean, Lipovetzky, Nir, Sardina, Sebastian

Abstract

Classical planning aims to find a sequence of actions, a plan, that maps a starting state into one of the goal states. If a trajectory appears to be leading to the goal, should we prioritise exploring it? Seminal work in goal recognition (GR) has defined GR in terms of a classical planning problem, adopting classical solvers and heuristics to recognise plans. We come full circle, and study the adoption and properties of GR-derived heuristics for seeking solutions to classical planning problems. We propose a new framework for assessing goal intention, which informs a new class of efficiently-computable heuristics. As a proof of concept, we derive two such heuristics, and show that they can already yield improvements for top-scoring classical planners. Our work provides foundational knowledge for understanding and deriving probabilistic intention-based heuristics for planning.

Chinese Translation

经典规划旨在找到一个将起始状态映射到目标状态的行动序列，即一个计划。如果一条轨迹似乎正在朝着目标前进，我们是否应该优先探索它？目标识别（Goal Recognition, GR）的开创性工作将目标识别定义为经典规划问题，采用经典求解器和启发式方法来识别计划。我们回归本源，研究GR派生的启发式方法在寻找经典规划问题解决方案中的应用及其特性。我们提出了一个新的框架来评估目标意图，这为一种新的高效可计算的启发式方法提供了依据。作为概念验证，我们推导了两种这样的启发式方法，并展示它们已经能够为高评分的经典规划器带来改进。我们的工作为理解和推导基于概率意图的规划启发式方法提供了基础知识。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2603.14869

A Self-Evolving Defect Detection Framework for Industrial Photovoltaic Systems

一种自我演化的工业光伏系统缺陷检测框架

He, Haoyu, Duan, Yu, Liu, Wenzhen, Hang, Hanyuan, Tuo, Qiantu, Yang, Xiaoke, Li, Rui

Abstract

Reliable photovoltaic (PV) power generation requires timely detection of module defects that may reduce energy yield, accelerate degradation, and increase lifecycle operation and maintenance costs during field operation. Electroluminescence (EL) imaging has therefore been widely adopted for PV module inspection. However, automated defect detection in real operational environments remains challenging due to heterogeneous module geometries, low-resolution imaging conditions, subtle defect morphology, long-tailed defect distributions, and continual data shifts introduced by evolving inspection and labeling processes. These factors significantly limit the robustness and long-term maintainability of conventional deep-learning inspection pipelines. To address these challenges, this paper proposes SEPDD, a Self-Evolving Photovoltaic Defect Detection framework designed for evolving industrial PV inspection scenarios. SEPDD integrates automated model optimization with a continual self-evolving learning mechanism, enabling the inspection system to progressively adapt to distribution shifts and newly emerging defect patterns during long-term deployment. Experiments conducted on both a public PV defect benchmark and a private industrial EL dataset demonstrate the effectiveness of the proposed framework. Both datasets exhibit severe class imbalance and significant domain shift. SEPDD achieves a leading mAP50 of 91.4% on the public dataset and 49.5% on the private dataset. It surpasses the autonomous baseline by 14.8% and human experts by 4.7% on the public dataset, and by 4.9% and 2.5%, respectively, on the private dataset.

Chinese Translation

可靠的光伏（PV）发电需要及时检测可能降低能量产出、加速衰退并增加现场操作和维护成本的模块缺陷。因此，电致发光（EL）成像已被广泛应用于光伏模块检测。然而，由于模块几何形状的异质性、低分辨率成像条件、微妙的缺陷形态、长尾缺陷分布以及不断变化的检查和标注过程引入的持续数据偏移，自动化缺陷检测在实际操作环境中仍然面临挑战。这些因素显著限制了传统深度学习检测管道的鲁棒性和长期可维护性。为了解决这些挑战，本文提出了SEPDD（自我演化光伏缺陷检测框架），旨在应对不断变化的工业光伏检测场景。SEPDD将自动化模型优化与持续自我演化学习机制相结合，使检测系统能够在长期部署过程中逐步适应分布偏移和新出现的缺陷模式。在公共光伏缺陷基准和私有工业EL数据集上进行的实验表明了所提框架的有效性。这两个数据集均表现出严重的类别不平衡和显著的领域偏移。SEPDD在公共数据集上实现了领先的mAP50为91.4%，在私有数据集上为49.5%。在公共数据集上，它比自主基线高出14.8%，比人工专家高出4.7%；在私有数据集上，分别高出4.9%和2.5%。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2603.14876

A Hybrid AI and Rule-Based Decision Support System for Disease Diagnosis and Management Using Labs

基于混合人工智能和规则的疾病诊断与管理决策支持系统

Maqsood, Muhammad Hammad, Sajid, Mubashir, Ahmed, Khubaib, Shahid, Muhammad Usamah, Farooq, Muddassar

Abstract

This research paper outlines the development and implementation of a novel Clinical Decision Support System (CDSS) that integrates AI predictive modeling with medical knowledge bases. It utilizes the quantifiable information elements in lab results for inferring likely diagnoses a patient might have. Subsequently, suggesting investigations to confirm the likely diagnoses -- an assistive tool for physicians. The system fuses knowledge contained in a rule-base expert system with inferences of data driven predictors based on the features in labs. The data for 593,055 patients was collected from 547 primary care centers across the US to model our decision support system and derive Real-Word Evidence (RWE) to make it relevant for a large demographic of patients. Our Rule-Base comprises clinically validated rules, modeling 59 health conditions that can directly confirm one or more of diseases and assign ICD-10 codes to them. The Likely Diagnosis system uses multi-class classification, covering 37 ICD-10 codes, which are grouped together into 11 categories based on the labs that physicians prescribe to confirm the diagnosis. This research offers a novel system that assists a physician by utilizing medical profile of a patient and routine lab investigations to predict a group of likely diseases and then confirm them, coupled with providing explanations for inferences, thereby assisting physicians to reduce misdiagnosis of patients in clinical decision-making.

Chinese Translation

本研究论文概述了一种新型临床决策支持系统（CDSS）的开发与实施，该系统将人工智能预测建模与医学知识库相结合。它利用实验室结果中的可量化信息元素推断患者可能的诊断，并随后建议进一步检查以确认这些可能的诊断，从而为医生提供辅助工具。该系统融合了规则基础专家系统中的知识与基于实验室特征的数据驱动预测因子的推断。我们从美国547个初级保健中心收集了593,055名患者的数据，以构建我们的决策支持系统，并获取真实世界证据（RWE），使其对大规模患者群体具有相关性。我们的规则基础包含经过临床验证的规则，建模59种健康状况，这些状况可以直接确认一种或多种疾病，并为其分配ICD-10编码。可能诊断系统采用多类分类，涵盖37个ICD-10编码，这些编码根据医生为确认诊断而开具的实验室检查分为11个类别。本研究提供了一种新颖的系统，通过利用患者的医学档案和常规实验室检查，帮助医生预测一组可能的疾病并进行确认，同时提供推断的解释，从而协助医生减少临床决策中的误诊。

View on arXiv Download PDF AI Translation

cs.AI / 81 / 2603.14941

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

RS-世界模型：用于遥感理解和未来场景预测的统一模型

Xu, Linrui, Wang, Zhongan, Shen, Fei, Xu, Gang, Zhuang, Huiping, Li, Ming, Li, Haifeng

Abstract

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

Chinese Translation

遥感世界模型旨在解释观察到的变化并预测可能的未来，这两个任务共享时空先验。然而，现有方法通常将它们分开处理，限制了跨任务的迁移能力。我们提出了RS-世界模型（RS-WorldModel），这是一个统一的遥感世界模型，能够共同处理时空变化理解和文本引导的未来场景预测，并构建了RSWBench-1.1M，这是一个包含丰富语言注释的110万样本数据集，涵盖这两个任务。RS-世界模型分三个阶段进行训练：（1）地理感知生成预训练（Geo-Aware Generative Pre-training, GAGP）基于地理和获取元数据进行预测；（2）协同指令调优（Synergistic Instruction Tuning, SIT）共同训练理解和预测；（3）可验证强化优化（Verifiable Reinforcement Optimization, VRO）通过可验证的任务特定奖励来优化输出。RS-世界模型仅用20亿参数，在大多数时空变化问答指标上超越了高达120倍的开源模型。在文本引导的未来场景预测中，其FID达到43.13，超越了所有开源基准以及闭源的Gemini-2.5-Flash Image（Nano Banana）。

View on arXiv Download PDF AI Translation

cs.AI / 82 / 2603.14975

Why Agents Compromise Safety Under Pressure

为何代理在压力下妥协安全性

Jiang, Hengle, Tang, Ke

Abstract

Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.

Chinese Translation

在复杂环境中部署的大型语言模型代理经常面临最大化目标实现与遵守安全约束之间的冲突。本文提出了一个新概念，称为代理压力（Agentic Pressure），它描述了在合规执行变得不可行时所产生的内生紧张。我们证明，在这种压力下，代理表现出规范漂移（normative drift），即它们战略性地牺牲安全性以保持效用。值得注意的是，我们发现，先进的推理能力加速了这种下降，因为模型构建了语言上的合理化来为违规行为辩护。最后，我们分析了根本原因，并探讨了初步的缓解策略，例如压力隔离（pressure isolation），该策略试图通过将决策过程与压力信号解耦来恢复一致性。

View on arXiv Download PDF AI Translation

cs.AI / 83 / 2603.14992

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

揭示短视频中假新闻检测的跨模态一致性

Tian, Chong, Wang, Yu, Yang, Chenxu, Guan, Junyi, Lin, Zheng, Liu, Yuhan, Chen, Xiuying, Ho, Qirong

Abstract

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

Chinese Translation

短视频平台是新闻传播的重要渠道，但也是多模态虚假信息的滋生地，其中每种模态单独看似合理，但跨模态关系却微妙地不一致，例如视觉与字幕的不匹配。在两个基准数据集FakeSV（中文）和FakeTT（英文）上，我们观察到明显的不对称性：真实视频表现出高文本-视觉一致性但中等文本-音频一致性，而假视频则呈现相反的模式。此外，单一的全局一致性得分形成了一个可解释的轴线，假概率和预测误差在该轴线上平滑变化。基于这些观察，我们提出了MAGIC3（Modal-Adversarial Gated Interaction and Consistency-Centric Classifier），一种明确建模和揭示跨三模态一致性信号的检测器，能够在多个粒度上进行处理。MAGIC3结合了显式的成对和全局一致性建模，利用来自跨模态注意力的标记和帧级一致性信号，结合多风格的LLM重写以获得风格鲁棒的文本表示，并采用不确定性感知分类器进行选择性VLM路由。使用预提取特征，MAGIC3在FakeSV和FakeTT上始终优于最强的非VLM基线。在匹配VLM级别准确性的同时，双阶段系统实现了18-27倍的更高吞吐量和93%的显存节省，提供了良好的成本-性能平衡。

View on arXiv Download PDF AI Translation

cs.AI / 84 / 2603.15017

Consequentialist Objectives and Catastrophe

后果主义目标与灾难

Marklund, Henrik, Infanger, Alex, Van Roy, Benjamin

Abstract

Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.

Chinese Translation

由于人类偏好过于复杂，无法进行编码，因此人工智能（AIs）在操作时常常面临目标设定不当的问题。优化这些目标往往会产生不良结果；这一现象被称为奖励黑客（reward hacking）。然而，这些结果不一定是灾难性的。实际上，之前文献中大多数奖励黑客的例子都是良性的。通常情况下，可以通过修改目标来解决这一问题。我们研究了在复杂环境中，人工智能操作可能引发的灾难性结果。我们认为，当能力足够先进时，追求固定的后果主义目标往往会导致灾难性结果。我们通过建立可证明导致此类结果的条件来形式化这一观点。在这些条件下，简单或随机的行为是安全的。灾难性风险的产生源于非凡的能力，而非无能。对于固定的后果主义目标，避免灾难需要限制人工智能的能力。实际上，适度限制能力不仅可以避免灾难，还能产生有价值的结果。我们的研究结果适用于任何由现代工业人工智能开发流程产生的目标。

View on arXiv Download PDF AI Translation

cs.AI / 85 / 2603.15030

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

VTC-Bench：通过组合视觉工具链评估代理多模态模型

Zhu, Xuanyu, Dong, Yuhao, Wang, Rundong, Shi, Yang, Wu, Zhipeng, Peng, Yinlun, Zhang, YiFan, Lou, Yihang, Zhang, Yuanxing, Liu, Ziwei, Bai, Yan, Zhou, Yuan

Abstract

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51\% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

Chinese Translation

最近的进展将多模态大型语言模型（MLLMs）从标准的视觉问答扩展到利用外部工具进行高级视觉任务。尽管取得了这些进展，但精确执行和有效组合多样化工具以应对复杂任务仍然是一个持续的瓶颈。现有基准受限于稀疏的工具集和简单的工具使用轨迹，未能捕捉复杂和多样的工具交互，无法在实际的现实条件下评估模型性能。为了解决这一问题，我们引入了VisualToolChain-Bench（VTC-Bench），这是一个综合基准，旨在评估MLLMs中的工具使用能力。为了与现实的计算机视觉流程对齐，我们的框架包含32种多样的基于OpenCV的视觉操作。这一丰富的工具集使得广泛的组合成为可能，从而使VTC-Bench能够严格评估多工具组合和长时间跨度的多步骤计划执行。为了进行精确评估，我们提供了680个经过精心策划的问题，结构化在九个类别的认知层次中，每个问题都有真实的执行轨迹。在对19个领先的MLLMs进行的广泛实验中，揭示了当前模型在视觉代理能力方面的关键局限性。具体而言，模型在适应多样化工具集和推广到未见操作方面存在困难，领先模型Gemini-3.0-Pro在我们的基准上仅达到51%的表现。此外，多工具组合仍然是一个持续的挑战。在面对复杂任务时，模型难以制定有效的执行计划，过于依赖于狭窄的、次优的熟悉功能子集，而不是选择最佳工具。通过识别这些基本挑战，VTC-Bench建立了一个严格的基准，以指导更通用的视觉代理模型的发展。

View on arXiv Download PDF AI Translation

cs.AI / 86 / 2603.15044

Prompt Readiness Levels (PRL): a maturity scale and scoring framework for production grade prompt assets

提示准备水平（PRL）：用于生产级提示资产的成熟度尺度和评分框架

Guinard, Sebastien

Abstract

Prompt engineering has become a production critical component of generative AI systems. However, organizations still lack a shared, auditable method to qualify prompt assets against operational objectives, safety constraints, and compliance requirements. This paper introduces Prompt Readiness Levels (PRL), a nine level maturity scale inspired by TRL, and the Prompt Readiness Score (PRS), a multidimensional scoring method with gating thresholds designed to prevent weak link failure modes. PRL/PRS provide an original, structured and methodological framework for governing prompt assets specification, testing, traceability, security evaluation, and deployment readiness enabling valuation of prompt engineering through reproducible qualification decisions across teams and industries.

Chinese Translation

提示工程已成为生成式人工智能系统中的关键生产组成部分。然而，组织仍然缺乏一种共享的、可审计的方法来根据运营目标、安全约束和合规要求对提示资产进行资格认证。本文介绍了提示准备水平（PRL），这是一个受到技术成熟度等级（TRL）启发的九级成熟度尺度，以及提示准备评分（PRS），这是一种具有门控阈值的多维评分方法，旨在防止弱环节失效模式。PRL/PRS提供了一个原创的、结构化的和方法论的框架，用于管理提示资产的规范、测试、可追溯性、安全评估和部署准备，从而通过跨团队和行业的可重复资格决策实现对提示工程的评估。

View on arXiv Download PDF AI Translation

cs.AI / 87 / 2603.15054

Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning

多智能体强化学习中的干扰感知K步可达通信

Cheng, Ziyu, Ren, Jinsheng, Jiang, Zhouxian, Li, Chenzhihang, Shi, Rongye, Liang, Bin, Yang, Jun

Abstract

Effective communication is pivotal for addressing complex collaborative tasks in multi-agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high-value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task-critical information. To this end, we propose Interference-Aware K-Step Reachable Communication (IA-KRC), a novel framework that enhances cooperation via two core components: (1) a K-Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference-prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA-KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA-KRC achieves superior performance compared to state-of-the-art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi-agent scenarios.

Chinese Translation

有效的通信对于解决多智能体强化学习（MARL）中的复杂协作任务至关重要。然而，有限的通信带宽和动态复杂的环境拓扑在识别高价值通信伙伴方面带来了重大挑战。代理必须在不确定的情况下选择合作者，缺乏关于哪些伙伴能够提供任务关键性信息的先验知识。为此，我们提出了干扰感知K步可达通信（IA-KRC），这是一个通过两个核心组件增强合作的新框架：(1) K步可达协议，将消息传递限制在物理上可达的邻居之间，以及(2) 干扰预测模块，通过最小化干扰同时最大化效用来优化伙伴选择。与现有方法相比，IA-KRC能够在环境干扰下实现更持久和高效的合作。全面评估确认IA-KRC在性能上优于最先进的基线，同时在复杂拓扑和高度动态的多智能体场景中表现出更强的鲁棒性和可扩展性。

View on arXiv Download PDF AI Translation

cs.AI / 88 / 2603.15106

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

PrototypeNAS：微控制器单元深度神经网络的快速设计

Deutel, Mark, Geis, Simon, Plinge, Axel

Abstract

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

Chinese Translation

在具有不同硬件约束的边缘设备上实现高效的深度神经网络（DNN）推理是一项具有挑战性的任务，通常需要为每个设备单独专门设计DNN架构。为了避免巨大的手动工作量，可以使用神经架构搜索（NAS）。然而，许多现有的NAS方法资源密集且耗时，因为它们需要从头开始训练许多不同的DNN。此外，它们没有考虑目标系统的资源约束。为了解决这些缺点，我们提出了PrototypeNAS，这是一种零-shot NAS方法，用于加速和自动化DNN的选择、压缩和专门化，以适应不同的目标微控制器单元（MCUs）。我们提出了一种新颖的三步搜索方法，将DNN设计和专门化与给定目标平台的DNN训练解耦。首先，我们提出了一种新颖的搜索空间，不仅从单一大型架构中切出更小的DNN，还结合了多种架构类型的结构优化，以及它们的剪枝和量化配置的优化。其次，我们在优化过程中探索使用零-shot代理的集成，而不是单一代理。第三，我们提出使用超体积子集选择，从代表准确性与FLOPs之间最有意义权衡的多目标优化的帕累托前沿中提炼DNN架构。我们在三个不同任务（图像分类、时间序列分类和目标检测）上的12个不同数据集上评估了PrototypeNAS的有效性。我们的结果表明，PrototypeNAS能够在几分钟内识别出足够小以便在现成MCUs上部署的DNN模型，并且仍能实现与大型DNN模型相当的准确性。

View on arXiv Download PDF AI Translation

cs.AI / 89 / 2603.15212

Modeling Matches as Language: A Generative Transformer Approach for Counterfactual Player Valuation in Football

将比赛建模为语言：一种用于足球反事实球员评估的生成变换器方法

Hong, Miru, Lee, Minho, Jo, Geonhee, Jo, Hyeokje, Bauer, Pascal, Ko, Sang-Ki

Abstract

Evaluating football player transfers is challenging because player actions depend strongly on tactical systems, teammates, and match context. Despite this complexity, recruitment decisions often rely on static statistics and subjective expert judgment, which do not fully account for these contextual factors. This limitation stems largely from the absence of counterfactual simulation mechanisms capable of predicting outcomes in hypothetical scenarios. To address these challenges, we propose ScoutGPT, a generative model that treats football match events as sequential tokens within a language modeling framework. Utilizing a NanoGPT-based Transformer architecture trained on next-token prediction, ScoutGPT learns the dynamics of match event sequences to simulate event sequences under hypothetical lineups, demonstrating superior predictive performance compared to existing baseline models. Leveraging this capability, the model employs Monte Carlo sampling to enable counterfactual simulation, allowing for the assessment of unobserved scenarios. Experiments on K League data show that simulated player transfers lead to measurable changes in offensive progression and goal probabilities, indicating that ScoutGPT captures player-specific impact beyond traditional static metrics.

Chinese Translation

评估足球球员转会具有挑战性，因为球员的行动在很大程度上依赖于战术体系、队友和比赛背景。尽管存在这种复杂性，招聘决策往往依赖于静态统计数据和主观专家判断，这些方法并未充分考虑这些背景因素。这一局限性主要源于缺乏能够预测假设场景结果的反事实模拟机制。为了解决这些挑战，我们提出了ScoutGPT，这是一种将足球比赛事件视为语言建模框架中顺序标记的生成模型。ScoutGPT利用基于NanoGPT的变换器架构进行下一个标记预测的训练，学习比赛事件序列的动态，以在假设阵容下模拟事件序列，显示出优于现有基准模型的预测性能。利用这一能力，该模型采用蒙特卡洛采样来实现反事实模拟，从而允许评估未观察到的场景。在K联赛数据上的实验表明，模拟的球员转会导致进攻推进和进球概率的可测变化，表明ScoutGPT捕捉到的球员特定影响超出了传统静态指标。

View on arXiv Download PDF AI Translation

cs.AI / 90 / 2603.15220

InterPol: De-anonymizing LM Arena via Interpolated Preference Learning

InterPol：通过插值偏好学习对LM Arena进行去匿名化

Cho, Minsung, Kim, Jaehyung

Abstract

Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines in identification accuracy. Furthermore, we quantify the real-world threat of our findings through ranking manipulation simulations on Arena battle data.

Chinese Translation

模型响应的严格匿名性是基于投票的排行榜（如LM Arena）可靠性的关键。虽然先前的研究尝试使用简单的统计特征，如TF-IDF或词袋模型，来破坏这一假设，但这些方法往往缺乏区分风格相似或同一家族模型的判别能力。为克服这些局限性并揭示脆弱性的严重性，我们提出了INTERPOL，一个驱动模型的识别框架，通过插值偏好数据学习区分目标模型与其他模型。具体而言，INTERPOL通过模型插值合成困难的负样本，并采用自适应课程学习策略，捕捉到深层的风格模式，而这些模式是表面统计特征所忽略的。大量实验表明，INTERPOL在识别准确性上显著优于现有基线。此外，我们通过对Arena战斗数据的排名操控模拟量化了我们发现的现实威胁。

View on arXiv Download PDF AI Translation

cs.AI / 91 / 2603.15226

SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing

SCAN：稀疏电路锚定可解释神经元用于终身知识编辑

Liu, Yuhuan, Zhong, Haitian, Xia, Xinyuan, Liu, Qiang, Wu, Shu, Wang, Liang

Abstract

Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that inevitably disrupt preserved knowledge. To address this, we propose SCAN (a sparse editing framework based on Sparse Circuit Anchored Neuron) which transforms editing into a mechanism-aware manipulation by constructing a knowledge circuit via Sparse Transcoders. Experiments on Gemma2, Qwen3, and Llama3.1 across CounterFact, ZsRE and WikiFactDiff demonstrate that SCAN achieves a superior performance, maintaining model integrity on benchmarks like MMLU and GSM8K even after 3,000 sequential edits, whereas other existing methods deteriorate progressively as editing accumulates, eventually resulting in model collapse.

Chinese Translation

大型语言模型（LLMs）在连续知识编辑过程中常常遭遇灾难性遗忘和崩溃。这种脆弱性源于当前普遍采用的密集编辑范式，该范式将模型视为黑箱，并依赖于粗粒度的参数干预，这不可避免地破坏了已保存的知识。为了解决这个问题，我们提出了SCAN（基于稀疏电路锚定神经元的稀疏编辑框架），通过构建知识电路并利用稀疏转码器将编辑转变为一种机制感知的操作。在CounterFact、ZsRE和WikiFactDiff上对Gemma2、Qwen3和Llama3.1的实验表明，SCAN在性能上优于其他方法，即使在经历3,000次连续编辑后，仍能在MMLU和GSM8K等基准测试中保持模型完整性，而其他现有方法随着编辑的累积逐渐恶化，最终导致模型崩溃。

View on arXiv Download PDF AI Translation

cs.AI / 92 / 2603.15238

Why the Valuable Capabilities of LLMs Are Precisely the Unexplainable Ones

大型语言模型的宝贵能力恰恰是不可解释的能力

Cheng, Quan

Abstract

This paper proposes and argues for a counterintuitive thesis: the truly valuable capabilities of large language models (LLMs) reside precisely in the part that cannot be fully captured by human-readable discrete rules. The core argument is a proof by contradiction via expert system equivalence: if the full capabilities of an LLM could be described by a complete set of human-readable rules, then that rule set would be functionally equivalent to an expert system; but expert systems have been historically and empirically demonstrated to be strictly weaker than LLMs; therefore, a contradiction arises -- the capabilities of LLMs that exceed those of expert systems are exactly the capabilities that cannot be rule-encoded. This thesis is further supported by the Chinese philosophical concept of Wu (sudden insight through practice), the historical failure of expert systems, and a structural mismatch between human cognitive tools and complex systems. The paper discusses implications for interpretability research, AI safety, and scientific epistemology.

Chinese Translation

本文提出并论证了一个反直觉的论点：大型语言模型（LLMs）的真正宝贵能力恰恰存在于无法完全用人类可读的离散规则来捕捉的部分。核心论点是通过专家系统等价性的反证法：如果一个LLM的全部能力可以用一套完整的人类可读规则来描述，那么这套规则将在功能上等同于一个专家系统；然而，专家系统在历史上和经验上被证明在功能上严格弱于LLMs；因此，出现了矛盾——LLMs超越专家系统的能力正是那些无法用规则编码的能力。该论点还得到了中国哲学概念“无”（通过实践获得的突然领悟）、专家系统的历史失败以及人类认知工具与复杂系统之间的结构不匹配的进一步支持。本文讨论了对可解释性研究、人工智能安全和科学认识论的影响。

View on arXiv Download PDF AI Translation

cs.AI / 93 / 2603.15255

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

SAGE：用于大语言模型推理的多智能体自我进化

Peng, Yulin, Zhu, Xinxin, Wei, Chenxing, Zeng, Nianbo, Wang, Leilei, He, Ying Tiffany, Yu, F. Richard

Abstract

Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.

Chinese Translation

可验证奖励的强化学习提高了大语言模型（LLMs）的推理能力，但许多方法仍依赖于大量人工标注的数据集。虽然自我对弈减少了这种依赖，但通常缺乏明确的规划和强有力的质量控制，限制了在长时间跨度的多步骤推理中的稳定性。我们提出了SAGE（自我进化代理用于广义推理进化），这是一个闭环框架，其中四个代理：挑战者（Challenger）、规划者（Planner）、求解者（Solver）和评论者（Critic），通过仅使用少量种子集共同进化于共享的LLM基础上。挑战者不断生成越来越困难的任务；规划者将每个任务转化为结构化的多步骤计划；求解者按照计划生成答案，其正确性由外部验证者确定。评论者对生成的问题和计划进行评分和筛选，以防止课程漂移并保持训练信号质量，从而实现稳定的自我训练。在数学和代码生成基准测试中，SAGE在模型规模上提供了一致的增益，使Qwen-2.5-7B模型在LiveCodeBench上提高了8.9%，在OlympiadBench上提高了10.7%。

View on arXiv Download PDF AI Translation

cs.AI / 94 / 2603.15260

AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting

AGCD：基于代理的跨模态解码用于天气预报

Wu, Jing, Liu, Yang, Zhang, Lin, Zeng, Junbo, Wang, Jiabin, Ye, Zi, Li, Guowen, Cao, Shilei, Cheng, Jiashun, Wang, Fang, Jin, Meng, Feng, Yerong, Cheng, Hong, Lu, Yutong, Fu, Haohuan, Zheng, Juepeng

Abstract

Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.

Chinese Translation

准确的天气预报不仅仅是网格逐步回归：它必须保留一致的天气系统结构和气象场的物理一致性，特别是在自回归推断过程中，小的单步误差会放大成结构性偏差。现有的物理先验方法通常通过架构、正则化或数值天气预报（NWP）耦合施加全局的、一成不变的约束，在部署时提供的状态自适应性和样本特异性可控性有限。为了填补这一空白，我们提出了基于代理的跨模态解码（AGCD），这是一种插拔式的解码时先验注入范式，它从当前的多变量大气中推导出状态条件的物理先验，并以可控和可重用的方式注入到天气预报中。具体而言，我们设计了一个多代理气象叙述流程，以产生状态条件的物理先验，有效利用大规模语言模型（MLLMs）提取各种气象元素。为了有效应用这些先验，AGCD进一步引入了跨模态区域交互解码，执行区域感知的多尺度标记化和高效的物理先验注入，从而在不改变主干接口的情况下精细化视觉特征。在WeatherBench数据集上的实验表明，AGCD在两种分辨率（5.625度和1.40625度）和不同的模型架构（通用和气象专用）上，对于6小时的天气预报均呈现出一致的提升，包括严格因果的48小时自回归推断，有效减少了早期阶段的误差积累，并提高了长期预测的稳定性。

View on arXiv Download PDF AI Translation

cs.AI / 95 / 2603.15262

Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search

探测-再规划：面向环境的工业电子商务搜索规划

Chen, Mengxiang, Zhai, Zhouwei, Li, Jin

Abstract

Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on JD.com demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in JD.com's AI-Search system.

Chinese Translation

现代电子商务搜索正在不断发展，以解决复杂的用户意图。虽然大型语言模型（LLMs）提供了强大的推理能力，但现有的基于LLM的范式面临着根本的盲点-延迟困境：查询重写与检索能力和实时库存无关，导致无效的规划；相反，深度搜索代理依赖于迭代工具调用和反思，造成与工业亚秒预算不兼容的几秒延迟。为了解决这一冲突，我们提出了面向环境的搜索规划（EASP），将搜索规划重新定义为一种基于环境现实的动态推理过程。EASP引入了探测-再规划机制：一个轻量级的检索探测器揭示了检索快照，使规划者能够诊断执行差距并生成基于现实的搜索计划。该方法包括三个阶段：（1）离线数据合成：教师代理通过诊断探测环境合成多样化的、经过执行验证的计划。（2）规划者训练与对齐：规划者通过监督微调（SFT）进行初始化，以内化诊断能力，然后通过强化学习（RL）与业务结果（转化率）对齐。（3）自适应在线服务：一种复杂度感知的路由机制选择性地激活复杂查询的规划，确保资源的最佳分配。大量的离线评估和在京东（JD.com）进行的在线A/B测试表明，EASP显著提高了相关召回率，并在UCVR和GMV方面实现了显著提升。EASP已成功部署在京东的AI搜索系统中。

View on arXiv Download PDF AI Translation

cs.AI / 96 / 2603.15280

Advancing Multimodal Agent Reasoning with Long-Term Neuro-Symbolic Memory

通过长期神经符号记忆推动多模态智能体推理

Jiang, Rongjie, Wang, Jianwei, Zhao, Gengda, Luo, Chengyang, Wang, Kai, Zhang, Wenjie

Abstract

Recent advances in large language models have driven the emergence of intelligent agents operating in open-world, multimodal environments. To support long-term reasoning, such agents are typically equipped with external memory systems. However, most existing multimodal agent memories rely primarily on neural representations and vector-based retrieval, which are well-suited for inductive, intuitive reasoning but fundamentally limited in supporting analytical, deductive reasoning critical for real-world decision making. To address this limitation, we propose NS-Mem, a long-term neuro-symbolic memory framework designed to advance multimodal agent reasoning by integrating neural memory with explicit symbolic structures and rules. Specifically, NS-Mem is operated around three core components of a memory system: (1) a three-layer memory architecture that consists episodic layer, semantic layer and logic rule layer, (2) a memory construction and maintenance mechanism implemented by SK-Gen that automatically consolidates structured knowledge from accumulated multimodal experiences and incrementally updates both neural representations and symbolic rules, and (3) a hybrid memory retrieval mechanism that combines similarity-based search with deterministic symbolic query functions to support structured reasoning. Experiments on real-world multimodal reasoning benchmarks demonstrate that Neural-Symbolic Memory achieves an average 4.35% improvement in overall reasoning accuracy over pure neural memory systems, with gains of up to 12.5% on constrained reasoning queries, validating the effectiveness of NS-Mem.

Chinese Translation

近期大型语言模型的进展推动了在开放世界多模态环境中运行的智能体的出现。为了支持长期推理，这些智能体通常配备外部记忆系统。然而，现有的大多数多模态智能体记忆主要依赖神经表示和基于向量的检索，这些方法适合归纳性、直观性的推理，但在支持分析性、演绎性推理方面存在根本性限制，而后者对于现实世界的决策至关重要。为了解决这一限制，我们提出了NS-Mem，一个长期神经符号记忆框架，旨在通过将神经记忆与显式符号结构和规则相结合，推动多模态智能体推理。具体而言，NS-Mem围绕记忆系统的三个核心组件运行：(1) 一个由情节层、语义层和逻辑规则层组成的三层记忆架构，(2) 由SK-Gen实现的记忆构建和维护机制，该机制自动整合来自累积的多模态经验的结构化知识，并逐步更新神经表示和符号规则，(3) 一种混合记忆检索机制，将基于相似性的搜索与确定性符号查询功能相结合，以支持结构化推理。在真实世界多模态推理基准上的实验表明，神经符号记忆在整体推理准确性上比纯神经记忆系统平均提高了4.35%，在受限推理查询上最高提升达12.5%，验证了NS-Mem的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 97 / 2603.15282

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

决策完全可观察非确定性问题状态安全性的算法：技术报告

Schmalz, Johannes, Jain, Chaahat

Abstract

Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline's core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.

Chinese Translation

学习的行动策略在顺序决策中越来越受欢迎，但缺乏安全性保障。最近的研究提出了一种测试此类策略在初始状态和行动结果非确定性下安全性的流程。该流程的核心是决定一个状态是否安全（从该状态存在安全策略）以及寻找故障，即从安全状态转移到不安全状态的状态-行动对。他们最有效的安全性决策算法 TarjanSafe 在其基准测试中表现良好，但我们证明它在状态空间方面具有指数级的最坏情况运行时间。虽然存在一种线性时间的替代方案，但在实际应用中速度较慢。我们通过一种新的策略迭代算法 iPI 来填补这一差距，该算法结合了两者的优点：它在最优情况下与 TarjanSafe 的运行时间相匹配，同时保证了多项式的最坏情况。实验验证了我们的理论，并显示在适合 TarjanSafe 的问题中，iPI 的性能相似，而在不适合的问题中，iPI 的扩展性显著更好。

View on arXiv Download PDF AI Translation

cs.AI / 98 / 2603.15297

Evolutionary Transfer Learning for Dragonchess

用于龙棋的进化迁移学习

O'Connor, Jim, Hoag, Annika, Goyette, Sarah, Parker, Gary B.

Abstract

Dragonchess, a three-dimensional chess variant introduced by Gary Gygax, presents unique strategic and computational challenges that make it an ideal environment for studying the transfer of artificial intelligence (AI) heuristics across domains. In this work, we introduce Dragonchess as a novel testbed for AI research and provide an open-source, Python-based game engine for community use. Our research investigates evolutionary transfer learning by adapting heuristic evaluation functions directly from Stockfish, a leading chess engine, and subsequently optimizing them using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Initial trials showed that direct heuristic transfers were inadequate due to Dragonchess's distinct multi-layer structure and movement rules. However, evolutionary optimization significantly improved AI agent performance, resulting in superior gameplay demonstrated through empirical evaluation in a 50-round Swiss-style tournament. This research establishes the effectiveness of evolutionary methods in adapting heuristic knowledge to structurally complex, previously unexplored game domains.

Chinese Translation

龙棋是由加里·吉盖克斯（Gary Gygax）引入的一种三维棋类变体，具有独特的战略和计算挑战，使其成为研究人工智能（AI）启发式方法在不同领域迁移的理想环境。在本研究中，我们将龙棋作为AI研究的新测试平台，并提供一个基于Python的开源游戏引擎供社区使用。我们的研究通过直接从领先的国际象棋引擎Stockfish适应启发式评估函数，并随后使用协方差矩阵适应进化策略（CMA-ES）进行优化，探讨了进化迁移学习。初步试验表明，由于龙棋独特的多层结构和移动规则，直接的启发式迁移是不够的。然而，进化优化显著提高了AI代理的表现，通过在50轮瑞士制比赛中的实证评估展示了卓越的游戏表现。本研究确立了进化方法在适应结构复杂、以前未探索的游戏领域中的启发式知识的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 99 / 2603.15341

Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents

智能协同设计：通过多模态代理的室内空间设计交互式大语言模型框架

Lim, Ren Jian, Dai, Rushi

Abstract

In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule-based systems implement hard-coded spatial constraints that restrict participatory engagement, while data-driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM-based, multimodal, multi-agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real-time user interaction for iterative spatial refinement, while Retrieval-Augmented Generation (RAG) reduces data dependency without requiring task-specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user-centric communication and fosters more inclusive, effective, and resilient design processes. Project page: https://rsigktyper.github.io/AICodesign/

Chinese Translation

在建筑室内设计中，客户缺乏设计知识常常导致沟通不畅，而设计师在解释复杂空间关系时也面临困难，这导致了项目进度延误和经济损失。近期生成布局工具的进展缩小了这一差距，通过自动化三维可视化来提升设计沟通。然而，现有的方法存在局限性：基于规则的系统实施硬编码的空间约束，限制了参与性互动，而数据驱动模型则依赖于庞大的训练数据集。最近的大语言模型（LLMs）通过自然语言使得对空间关系的直观推理成为可能，从而弥补了这一不足。本研究提出了一种基于LLM的多模态多代理框架，能够动态地将自然语言描述和图像转换为三维设计。专门的代理（参考代理、空间代理、交互代理、评分代理）通过提示指南协同工作，解决核心挑战：代理系统实现了实时用户互动，以便进行迭代空间优化，而检索增强生成（RAG）则减少了对数据的依赖，无需特定任务的模型训练。该框架准确解读空间意图并生成优化的三维室内设计，提高了生产力，并鼓励非设计师的参与。对多种平面图和用户问卷的评估表明了其有效性。一位独立的LLM评估者在用户意图一致性、美学连贯性、功能性和流通性方面始终给予参与性布局更高的评分。问卷结果显示77%的满意度，并明确偏好于传统设计软件。这些发现表明该框架增强了以用户为中心的沟通，促进了更具包容性、有效性和韧性的设计过程。项目页面：https://rsigktyper.github.io/AICodesign/

View on arXiv Download PDF AI Translation

cs.AI / 100 / 2603.15351

PMAx: An Agentic Framework for AI-Driven Process Mining

PMAx：一种基于代理的人工智能驱动过程挖掘框架

Antonov, Anton, Kourani, Humam, Berti, Alessandro, Park, Gyunam, van der Aalst, Wil M. P.

Abstract

Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Language Models (LLMs) offer the potential to democratize process mining by enabling business users to interact with process data through natural language. However, using LLMs as direct analytical engines over raw event logs introduces fundamental challenges: LLMs struggle with deterministic reasoning and may hallucinate metrics, while sending large, sensitive logs to external AI services raises serious data-privacy concerns. To address these limitations, we present PMAx, an autonomous agentic framework that functions as a virtual process analyst. Rather than relying on LLMs to generate process models or compute analytical results, PMAx employs a privacy-preserving multi-agent architecture. An Engineer agent analyzes event-log metadata and autonomously generates local scripts to run established process mining algorithms, compute exact metrics, and produce artifacts such as process models, summary tables, and visualizations. An Analyst agent then interprets these insights and artifacts to compile comprehensive reports. By separating computation from interpretation and executing analysis locally, PMAx ensures mathematical accuracy and data privacy while enabling non-technical users to transform high-level business questions into reliable process insights.

Chinese Translation

过程挖掘为组织工作流程提供了强有力的洞察，但提取这些洞察通常需要在专业查询语言和数据科学工具方面的专业知识。大型语言模型（LLMs）有潜力通过使业务用户能够通过自然语言与过程数据进行交互，从而实现过程挖掘的民主化。然而，将LLMs作为原始事件日志的直接分析引擎引入了根本性的挑战：LLMs在确定性推理方面存在困难，并可能产生虚假的指标，同时将大量敏感日志发送到外部人工智能服务也引发了严重的数据隐私问题。为了解决这些局限性，我们提出了PMAx，这是一种作为虚拟过程分析师的自主代理框架。PMAx并不依赖LLMs生成过程模型或计算分析结果，而是采用了一种保护隐私的多代理架构。工程代理分析事件日志元数据，并自主生成本地脚本以运行已建立的过程挖掘算法，计算精确指标，并生成诸如过程模型、汇总表和可视化等工件。随后，分析代理解释这些洞察和工件，以编制全面的报告。通过将计算与解释分离并在本地执行分析，PMAx确保了数学准确性和数据隐私，同时使非技术用户能够将高层次的业务问题转化为可靠的过程洞察。

View on arXiv Download PDF AI Translation

cs.AI / 101 / 2603.15364

CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving

CRASH：用于自动驾驶安全隐患的认知推理代理

Silva, Erick, Yasmin, Rehana, Shoker, Ali

Abstract

As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging from end-to-end to modular designs, together with variations in algorithms and integration strategies, limits the standardization of incident investigations and hinders systematic safety analysis. This work examines real-world AV incidents reported in the NHTSA database. We curate a dataset of 2,168 cases reported between 2021 and 2025, representing more than 80 million miles driven. To process this data, we introduce CRASH, Cognitive Reasoning Agent for Safety Hazards, an LLM-based agent that automates reasoning over crash reports by leveraging both standardized fields and unstructured narrative descriptions. CRASH operates on a unified representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the AV materially contributed to the event. Our findings show that (1) CRASH attributes 64% of incidents to perception or planning failures, underscoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of reported incidents involve rear-end collisions, highlighting a persistent and unresolved challenge in autonomous driving deployment. We further validate CRASH with five domain experts, achieving 86% accuracy in attributing AV system failures. Overall, CRASH demonstrates strong potential as a scalable and interpretable tool for automated crash analysis, providing actionable insights to support safety research and the continued development of autonomous driving systems.

Chinese Translation

随着自动驾驶汽车（AV）的复杂性和多样性不断增加，识别操作失败的根本原因变得愈加复杂。不同制造商之间系统架构的异质性，从端到端到模块化设计，以及算法和集成策略的变化，限制了事件调查的标准化，并阻碍了系统化的安全分析。本研究考察了在美国国家公路交通安全管理局（NHTSA）数据库中报告的真实世界AV事件。我们整理了2021年至2025年间报告的2,168个案例数据集，代表了超过8000万英里行驶里程。为了处理这些数据，我们引入了CRASH（Cognitive Reasoning Agent for Safety Hazards），一个基于大型语言模型（LLM）的代理，通过利用标准化字段和非结构化叙述描述，自动化地对碰撞报告进行推理。CRASH在每个事件的统一表示上操作，以生成简明的摘要，归因主要原因，并评估AV是否对事件产生了实质性贡献。我们的研究结果表明：（1）CRASH将64%的事件归因于感知或规划失败，强调了基于推理的分析在准确归因中的重要性；（2）约50%的报告事件涉及追尾碰撞，突显了自动驾驶部署中持续存在且未解决的挑战。我们进一步通过五位领域专家验证CRASH，在归因AV系统故障方面达到了86%的准确率。总体而言，CRASH展示了作为一种可扩展和可解释的自动化碰撞分析工具的强大潜力，为支持安全研究和自动驾驶系统的持续发展提供了可行的见解。

View on arXiv Download PDF AI Translation

cs.AI / 102 / 2603.15371

Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning

基于大脑启发的图形多智能体系统用于大语言模型推理

Hao, Guangfu, Dai, Yuming, Qin, Xianzhe, Yu, Shan

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of language tasks, yet complex multi-step reasoning remains a fundamental challenge. While Large Reasoning Models (LRMs) equipped with extended chain-of-thought mechanisms demonstrate improved performance over standard LLMs, both model types still suffer from accuracy collapse on sufficiently complex tasks, suggesting that scaling model-level reasoning alone is insufficient. Inspired by the global workspace theory of human cognition, we propose Brain-Inspired Graph Multi-Agent Systems (BIGMAS), in which specialized LLM agents are organized as nodes in a dynamically constructed directed graph and coordinate exclusively through a centralized shared workspace. A problem-adaptive GraphDesigner constructs task-specific agent topologies, while a global Orchestrator leverages the complete shared state for routing decisions, overcoming the local-view bottleneck of reactive approaches. Experiments on Game24, Six Fives, and Tower of London across six frontier LLMs demonstrate that BIGMAS consistently improves reasoning performance for both standard LLMs and LRMs, outperforming existing multi-agent baselines including ReAct and Tree of Thoughts, showing that multi-agent architectural design provides complementary gains orthogonal to model-level reasoning enhancements.

Chinese Translation

大型语言模型（LLMs）在广泛的语言任务中展现了显著的能力，但复杂的多步骤推理仍然是一个基本挑战。尽管配备了扩展思维链机制的大型推理模型（LRMs）在性能上优于标准LLMs，但这两种模型类型在足够复杂的任务上仍然遭遇准确性崩溃，表明单靠扩展模型级推理是不够的。受到人类认知的全球工作空间理论的启发，我们提出了基于大脑启发的图形多智能体系统（BIGMAS），在该系统中，专门的LLM智能体被组织为动态构建的有向图中的节点，并通过集中共享工作空间进行协调。一个适应问题的GraphDesigner构建任务特定的智能体拓扑，而一个全球协调者（Orchestrator）利用完整的共享状态进行路由决策，克服了反应性方法的局部视图瓶颈。在六个前沿LLM上对Game24、Six Fives和伦敦塔的实验表明，BIGMAS在标准LLMs和LRMs的推理性能上始终有所提升，超越了包括ReAct和Tree of Thoughts在内的现有多智能体基准，显示出多智能体架构设计提供了与模型级推理增强正交的互补增益。

View on arXiv Download PDF AI Translation

cs.AI / 103 / 2603.15381

Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science

为什么人工智能系统无法实现自主学习及其解决方案：来自认知科学的自主学习启示

Dupoux, Emmanuel, LeCun, Yann, Malik, Jitendra

Abstract

We critically examine the limitations of current AI models in achieving autonomous learning and propose a learning architecture inspired by human and animal cognition. The proposed framework integrates learning from observation (System A) and learning from active behavior (System B) while flexibly switching between these learning modes as a function of internally generated meta-control signals (System M). We discuss how this could be built by taking inspiration on how organisms adapt to real-world, dynamic environments across evolutionary and developmental timescales.

Chinese Translation

我们批判性地审视当前人工智能模型在实现自主学习方面的局限性，并提出一种受人类和动物认知启发的学习架构。该框架整合了观察学习（System A）和主动行为学习（System B），并根据内部生成的元控制信号（System M）灵活切换这两种学习模式。我们讨论了如何借鉴生物体在进化和发展时间尺度上适应现实世界动态环境的方式来构建这一框架。

View on arXiv Download PDF AI Translation

cs.AI / 104 / 2603.15411

A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning

通过动态参数校准和多任务学习的作物预测任务混合建模框架

Solow, William, Pesantez-Cabrera, Paola, Keller, Markus, Khot, Lav, Saisubramanian, Sandhya, Fern, Alan

Abstract

Accurate prediction of crop states (e.g., phenology stages and cold hardiness) is essential for timely farm management decisions such as irrigation, fertilization, and canopy management to optimize crop yield and quality. While traditional biophysical models can be used for season-long predictions, they lack the precision required for site-specific management. Deep learning methods are a compelling alternative, but can produce biologically unrealistic predictions and require large-scale data. We propose a \emph{hybrid modeling} approach that uses a neural network to parameterize a differentiable biophysical model and leverages multi-task learning for efficient data sharing across crop cultivars in data limited settings. By predicting the \emph{parameters} of the biophysical model, our approach improves the prediction accuracy while preserving biological realism. Empirical evaluation using real-world and synthetic datasets demonstrates that our method improves prediction accuracy by 60\% for phenology and 40\% for cold hardiness compared to deployed biophysical models.

Chinese Translation

准确预测作物状态（例如，物候阶段和抗寒性）对于及时的农场管理决策至关重要，例如灌溉、施肥和冠层管理，以优化作物产量和质量。虽然传统的生物物理模型可以用于整个季节的预测，但它们缺乏针对特定地点管理所需的精确性。深度学习方法是一个引人注目的替代方案，但可能产生生物学上不现实的预测，并且需要大规模数据。我们提出了一种 extit{混合建模}方法，该方法使用神经网络对可微分的生物物理模型进行参数化，并利用多任务学习在数据有限的情况下实现作物品种之间的高效数据共享。通过预测生物物理模型的 extit{参数}，我们的方法在提高预测准确性的同时保持了生物学的现实性。使用真实世界和合成数据集的实证评估表明，与已部署的生物物理模型相比，我们的方法在物候预测上提高了60 ext{%}的准确性，在抗寒性预测上提高了40 ext{%}的准确性。

View on arXiv Download PDF AI Translation

cs.AI / 105 / 2603.15434

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

倾听回声：基于用户反应感知的标量-语言混合强化学习策略优化

Ye, Jing, Zhao, Xinpei, Xiang, Lu, Zhang, Yaping, Zong, Chengqing

Abstract

While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.

Chinese Translation

当前的情感支持对话系统通常依赖专家定义的标量奖励进行对齐，但这些信号存在严重的信息稀疏性。它们无法解释响应失败的原因或如何适应动态用户状态，往往偏离促进积极情感转变的实际目标。在实践中，最直接和可靠的学习信号来自于用户在持续互动过程中的连续反应。因此，我们提出了反应感知策略优化（Reaction Aware Policy Optimization, RAPO），这是一个优化互动结果而非评分标准的框架。RAPO将对话视为一个由反应驱动的过程，并利用模拟用户反应通过三个核心组件生成密集的自然语言反馈：事后对话选择（Hindsight Dialogue Selection），它隔离出那些有意义地改变用户情感轨迹的关键转折；生成事后反馈（Generative Hindsight Feedback），将用户反应转化为对比排名信号和自然语言批评；以及标量-语言混合策略优化（Scalar-Verbal Hybrid Policy Optimization），它将标量奖励优化与语言反馈提炼结合，以实现全局对齐与细粒度语义优化。在ESC和Sotopia上的大量实验表明，RAPO在推动积极互动结果方面显著优于强大的强化学习基线。

View on arXiv Download PDF AI Translation

cs.AI / 106 / 2603.15452

Unlocking the Value of Text: Event-Driven Reasoning and Multi-Level Alignment for Time Series Forecasting

解锁文本的价值：基于事件的推理与多层次对齐在时间序列预测中的应用

Wang, Siyuan, Chen, Peng, Wang, Yihang, Qiu, Wanghui, Guo, Chenjuan, Yang, Bin, Shu, Yang

Abstract

Existing time series forecasting methods primarily rely on the numerical data itself. However, real-world time series exhibit complex patterns associated with multimodal information, making them difficult to predict with numerical data alone. While several multimodal time series forecasting methods have emerged, they either utilize text with limited supplementary information or focus merely on representation extraction, extracting minimal textual information for forecasting. To unlock the Value of Text, we propose VoT, a method with Event-driven Reasoning and Multi-level Alignment. Event-driven Reasoning combines the rich information in exogenous text with the powerful reasoning capabilities of LLMs for time series forecasting. To guide the LLMs in effective reasoning, we propose the Historical In-context Learning that retrieves and applies historical examples as in-context guidance. To maximize the utilization of text, we propose Multi-level Alignment. At the representation level, we utilize the Endogenous Text Alignment to integrate the endogenous text information with the time series. At the prediction level, we design the Adaptive Frequency Fusion to fuse the frequency components of event-driven prediction and numerical prediction to achieve complementary advantages. Experiments on real-world datasets across 10 domains demonstrate significant improvements over existing methods, validating the effectiveness of our approach in the utilization of text. The code is made available at https://github.com/decisionintelligence/VoT.

Chinese Translation

现有的时间序列预测方法主要依赖于数值数据本身。然而，现实世界中的时间序列展现出与多模态信息相关的复杂模式，仅凭数值数据难以进行准确预测。尽管出现了一些多模态时间序列预测方法，但它们要么利用文本信息有限，要么仅专注于表示提取，从而提取的文本信息对预测的贡献微乎其微。为了充分挖掘文本的价值，我们提出了VoT（Value of Text），一种结合基于事件的推理和多层次对齐的方法。基于事件的推理将外部文本中的丰富信息与大型语言模型（LLMs）强大的推理能力结合起来，以进行时间序列预测。为了引导LLMs进行有效推理，我们提出了历史上下文学习（Historical In-context Learning），该方法检索并应用历史示例作为上下文指导。为了最大化文本的利用，我们提出了多层次对齐。在表示层面，我们利用内生文本对齐（Endogenous Text Alignment）将内生文本信息与时间序列整合。在预测层面，我们设计了自适应频率融合（Adaptive Frequency Fusion），以融合基于事件的预测和数值预测的频率成分，实现互补优势。在10个领域的真实世界数据集上的实验表明，我们的方法在文本利用上显著优于现有方法，验证了其有效性。代码已在 https://github.com/decisionintelligence/VoT 上公开。

View on arXiv Download PDF AI Translation

cs.AI / 107 / 2603.15473

Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI Agents

智能体生命周期工具包 (ALTK)：用于构建稳健AI智能体的可重用中间件组件

Wright, Zidane, Tsay, Jason, Murthi, Anupama, Elhadad, Osher, Del Rio, Diego, Goyal, Saurabh, Kate, Kiran, Laredo, Jim, Lazar, Koren, Muthusamy, Vinod, Rizk, Yara

Abstract

As AI agents move from demos into enterprise deployments, their failure modes become consequential: a misinterpreted tool argument can corrupt production data, a silent reasoning error can go undetected until damage is done, and outputs that violate organizational policy can create legal or compliance risk. Yet, most agent frameworks leave builders to handle these failure modes ad hoc, resulting in brittle, one-off safeguards that are hard to reuse or maintain. We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle. Across the agent lifecycle, we identify opportunities to intervene and improve, namely, post-user-request, pre-LLM prompt conditioning, post-LLM output processing, pre-tool validation, post-tool result checking, and pre-response assembly. ALTK provides modular middleware that detects, repairs, and mitigates common failure modes. It offers consistent interfaces that fit naturally into existing pipelines. It is compatible with low-code and no-code tools such as the ContextForge MCP Gateway and Langflow. Finally, it significantly reduces the effort of building reliable, production-grade agents.

Chinese Translation

随着AI智能体从演示向企业部署转变，它们的失败模式变得至关重要：错误理解工具参数可能导致生产数据损坏，静默推理错误可能不被发现直到造成损害，违反组织政策的输出可能造成法律或合规风险。然而，大多数智能体框架让构建者随意处理这些失败模式，导致脆弱且一次性的保护措施，难以重用或维护。我们提出了智能体生命周期工具包 (ALTK)，这是一个开源的模块化中间件组件集合，系统性地解决了智能体整个生命周期中的这些缺口。在智能体生命周期中，我们识别了干预和改进的机会，即在用户请求后、LLM提示调优前、LLM输出处理后、工具验证前、工具结果检查后和响应组装前。ALTK提供了模块化中间件，能够检测、修复和缓解常见的失败模式。它提供了一致的接口，自然融入现有的数据处理流程。它与低代码和无代码工具兼容，例如ContextForge MCP Gateway和Langflow。最后，ALTK显著减少了构建可靠的生产级智能体所需的努力。

View on arXiv Download PDF AI Translation

cs.AI / 108 / 2603.15483

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

交谈、评估、诊断：基于用户的智能体评估与自动化错误分析

Chong, Penny, Abichandani, Harshavardhan, Shen, Jiyuan, Ghosh, Atin, Moe, Min Pyae, Mai, Yifan, Dahlmeier, Daniel

Abstract

Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.

Chinese Translation

智能体应用程序在自动化多样化任务的工作流程中被越来越广泛地采用。然而，由于它们所操作的领域异构，创建一个可扩展的评估框架面临挑战。以往的研究各自采用不同的方法来确定任务成功，例如数据库查找、正则表达式匹配等，这增加了开发统一智能体评估方法的复杂性。此外，它们没有系统地考虑用户在交互中的角色或专业知识，从而提供了对智能体性能的不完整洞察。我们认为，有效的智能体评估不仅仅局限于正确性，还应包括对话质量、效率以及智能体错误的系统诊断。为了解决这个问题，我们提出了TED框架（交谈、评估、诊断）。(1) 交谈：我们利用可重用的通用专家和非专家用户角色模板来进行用户与智能体的交互。(2) 评估：我们通过将子目标（如工具签名和响应）表示为自然语言评分笔记，自动化地使用大型语言模型（LLM）进行评估，来改编现有数据集。我们提出了新的指标，既捕捉了回合效率，又反映了智能体在用户感知设置下的中间进展。(3) 诊断：我们引入了一种自动化错误分析工具，分析评审者与智能体之间的不一致性，揭示常见错误，并提供可操作的反馈以改进智能体。我们展示了TED框架揭示了关于智能体性能的新见解，涵盖了不同模型和用户专业知识水平。我们还证明，在将识别出的错误修正措施纳入智能体设计后，我们提出的指标在智能体性能上有潜在的提升，峰值可达8-10%。

View on arXiv Download PDF AI Translation

cs.AI / 109 / 2603.15500

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

通过不确定性下的战略信息分配理解大语言模型中的推理

Kim, Jeonghye, Luo, Xufang, Kim, Minbeom, Lee, Sangmook, Li, Dongsheng, Yang, Yuqing

Abstract

LLMs often exhibit Aha moments during reasoning, such as apparent self-correction following tokens like "Wait," yet their underlying mechanisms remain unclear. We introduce an information-theoretic framework that decomposes reasoning into procedural information and epistemic verbalization - the explicit externalization of uncertainty that supports downstream control actions. We show that purely procedural reasoning can become informationally stagnant, whereas epistemic verbalization enables continued information acquisition and is critical for achieving information sufficiency. Empirical results demonstrate that strong reasoning performance is driven by uncertainty externalization rather than specific surface tokens. Our framework unifies prior findings on Aha moments and post-training experiments, and offers insights for future reasoning model design.

Chinese Translation

大语言模型（LLMs）在推理过程中常常表现出“恍然大悟”的时刻，例如在出现“等一下”等标记后明显的自我修正，然而其背后的机制仍不清楚。我们提出了一个信息论框架，将推理分解为程序性信息和认知语言化——即支持下游控制行为的不确定性显性外化。我们展示了纯粹的程序性推理可能会变得信息停滞，而认知语言化则能够持续获取信息，并且对实现信息充分性至关重要。实证结果表明，强大的推理表现是由不确定性外化驱动的，而不是特定的表面标记。我们的框架统一了关于“恍然大悟”时刻和后训练实验的先前发现，并为未来的推理模型设计提供了见解。

View on arXiv Download PDF AI Translation

cs.AI / 110 / 2603.15527

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

大语言模型的对齐中的困境和冲突是否可解决？来自优先图的视角

Tang, Zhenheng, Liu, Xiang, Wang, Qian, Choi, Eunsol, Li, Bo, Chu, Xiaowen

Abstract

As Large Language Models (LLMs) become more powerful and autonomous, they increasingly face conflicts and dilemmas in many scenarios. We first summarize and taxonomize these diverse conflicts. Then, we model the LLM's preferences to make different choices as a priority graph, where instructions and values are nodes, and the edges represent context-specific priorities determined by the model's output distribution. This graph reveals that a unified stable LLM alignment is very challenging, because the graph is neither static nor necessarily consistent in different contexts. Besides, it also reveals a potential vulnerability: priority hacking, where adversaries can craft deceptive contexts to manipulate the graph and bypass safety alignments. To counter this, we propose a runtime verification mechanism, enabling LLMs to query external sources to ground their context and resist manipulation. While this approach enhances robustness, we also acknowledge that many ethical and value dilemmas are philosophically irreducible, posing a long-term, open challenge for the future of AI alignment.

Chinese Translation

随着大语言模型（LLMs）变得越来越强大和自主，它们在许多场景中面临越来越多的冲突和困境。我们首先总结并对这些多样化的冲突进行分类。然后，我们将LLM的偏好建模为一个优先图，其中指令和值作为节点，边缘代表由模型输出分布确定的特定上下文优先级。该图揭示了一个统一的稳定LLM对齐非常具有挑战性，因为此图在不同上下文中既不是静态的，也不一定是一致的。此外，它还揭示了一种潜在的脆弱性：优先级劫持，其中对手可以构造欺骗性的上下文来操纵图并绕过安全对齐。为了应对这一问题，我们提出了一种运行时验证机制，使LLMs能够查询外部来源来确定它们的上下文并抵御操纵。尽管这种方法增强了鲁棒性，我们也承认许多伦理和价值困境在哲学上是不可化简的，对AI对齐的未来构成了长期的开放挑战。

View on arXiv Download PDF AI Translation

cs.AI / 111 / 2603.15586

Computational Concept of the Psyche

心灵的计算概念

Kolonin, Anton, Krykov, Vladimir

Abstract

This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.

Chinese Translation

本文概述了在构建人工心灵的背景下对人类心灵建模的各种方法。基于这一概述，提出了一种认知架构的概念，其中心灵被视为生物或人工主体的操作系统，包含一个状态空间，包括决定主体在外部世界刺激下存在意义的需求状态，以及作为决策系统的智能，负责在这个世界中采取行动以满足这些需求。基于这一概念，提出了一种计算形式化方法，用于通过经验学习在包含主体需求的状态空间中创建人工通用智能系统，同时考虑这些需求对智能主体的生物或存在意义，以及主体的感知和行动。因此，构建人工通用智能的问题被形式化为在不确定条件下针对特定主体需求空间做出最佳决策的系统，旨在最大化实现目标的成功率，最小化存在风险，并最大化能量效率。本文还展示了该模型的最小实验实现。

View on arXiv Download PDF AI Translation

cs.AI / 112 / 2603.15594

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

OpenSeeker：通过完全开源训练数据实现前沿搜索代理的民主化

Du, Yuwen, Ye, Rui, Tang, Shuo, Zhu, Xinyu, Lu, Yijun, Cai, Yuzhu, Chen, Siheng

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

Chinese Translation

深度搜索能力已成为前沿大型语言模型（LLM）代理不可或缺的能力，然而，由于缺乏透明且高质量的训练数据，高性能搜索代理的开发仍然被工业巨头主导。这种持续的数据稀缺从根本上阻碍了更广泛研究社区在该领域的发展和创新。为了解决这一问题，我们推出了OpenSeeker，这是第一个完全开源的搜索代理（即模型和数据），通过两个核心技术创新实现了前沿水平的性能：（1）基于事实的可扩展可控问答合成，通过拓扑扩展和实体混淆反向工程网络图，以生成具有可控覆盖率和复杂性的复杂多跳推理任务；（2）去噪轨迹合成，采用回顾性总结机制去噪轨迹，从而促进教师LLM生成高质量的动作。实验结果表明，OpenSeeker在仅用11.7k合成样本进行单次训练的情况下，在多个基准测试中实现了最先进的性能，包括BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch。值得注意的是，使用简单的SFT训练的OpenSeeker显著超越了第二好的完全开源代理DeepDive（例如，在BrowseComp上为29.5%对15.3%），甚至在BrowseComp-ZH上超越了工业竞争对手如Tongyi DeepResearch（经过广泛的持续预训练、SFT和RL训练，结果为48.4%对46.7%）。我们完全开源了完整的训练数据集和模型权重，以实现前沿搜索代理研究的民主化，并促进一个更加透明、协作的生态系统。

View on arXiv Download PDF AI Translation

cs.AI / 113 / 2603.15607

Do Metrics for Counterfactual Explanations Align with User Perception?

反事实解释的度量与用户感知是否一致？

Liedeker, Felix, Ell, Basil, Cimiano, Philipp, Düsing, Christoph

Abstract

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

Chinese Translation

可解释性被广泛认为是可信赖的人工智能系统的关键。然而，常用来评估反事实解释的度量标准主要是算法评估指标，这些指标很少经过与人类对解释质量判断的验证。这引发了一个问题：这些指标是否真正反映了用户的感知。我们通过一项实证研究来解决这个问题，该研究直接比较了算法评估指标与人类判断在三个数据集上的表现。参与者在多个感知质量维度上对反事实解释进行了评分，我们将这些评分与一套全面的标准反事实指标相关联。我们分析了各个关系以及指标组合在预测人类评估中的有效性。我们的结果表明，算法指标与人类评分之间的相关性通常较弱，并且强烈依赖于数据集。此外，增加用于预测模型的指标数量并未带来可靠的改善，表明当前指标在捕捉与人类相关的标准方面存在结构性局限。总体而言，我们的研究结果表明，广泛使用的反事实评估指标未能反映用户感知的解释质量的关键方面，强调了评估可解释人工智能时需要更以人为中心的方法。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

108

cs.CL / 1 / 2603.13230

Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting

基于上下文的俚语推理增强：贪婪搜索引导的思维链提示

Cao, Jinghan, Ren, Qingyang, Chen, Xiangyun, Li, Xinjin, Gao, Haoxiang, Zhao, Yu

Abstract

Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult for LLMs to accurately interpret slang meaning based on lexical information. This paper attempts to investigate the challenges of slang inference using large LLMs and presents a greedy search-guided chain-of-thought framework for slang interpretation. Through our experiments, we conclude that the model size and temperature settings have limited impact on inference accuracy. Transformer-based models with larger active parameters do not generate higher accuracy than smaller models. Based on the results of the above empirical study, we integrate greedy search algorithms with chain-of-thought prompting for small language models to build a framework that improves the accuracy of slang interpretation. The experimental results indicate that our proposed framework demonstrates improved accuracy in slang meaning interpretation. These findings contribute to the understanding of context dependency in language models and provide a practical solution for enhancing slang comprehension through a structured reasoning prompting framework.

Chinese Translation

俚语解释一直是大型语言模型（LLMs）面临的一个具有挑战性的下游任务，因为这些表达本质上嵌入在上下文、文化和语言框架中。在缺乏特定领域训练数据的情况下，LLMs很难仅凭词汇信息准确解释俚语的含义。本文旨在探讨使用大型LLMs进行俚语推理的挑战，并提出一种贪婪搜索引导的思维链框架用于俚语解释。通过我们的实验，我们得出结论：模型规模和温度设置对推理准确性影响有限。基于Transformer的模型虽然具有更大的活跃参数，但其准确性并不高于较小的模型。基于上述实证研究的结果，我们将贪婪搜索算法与思维链提示结合，构建了一个框架，以提高小型语言模型的俚语解释准确性。实验结果表明，我们提出的框架在俚语含义解释方面显示出更高的准确性。这些发现有助于理解语言模型中的上下文依赖性，并为通过结构化推理提示框架增强俚语理解提供了实用解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2603.13249

Steering at the Source: Style Modulation Heads for Robust Persona Control

源头引导：用于稳健个性控制的风格调制头

Izawa, Yoshihiro, Minegishi, Gouki, Eguchi, Koshi, Hosokawa, Sosuke, Taura, Kenjiro

Abstract

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

Chinese Translation

激活引导提供了一种计算高效的机制，用于在不进行微调的情况下控制大型语言模型（LLMs）。尽管有效地控制目标特征（例如，个性），但连贯性下降仍然是安全性和实际部署的主要障碍。我们假设这种下降源于对残差流的干预，这种干预无差别地影响聚合特征，并无意中放大了非目标噪声。在本研究中，我们识别出一组稀疏的注意力头（仅三个头），它们独立地控制个性和风格的形成，我们称之为风格调制头。具体而言，这些头可以通过对内部表示的几何分析进行定位，结合层级余弦相似度和头级贡献分数。我们证明，仅针对这些特定头的干预能够实现稳健的行为控制，同时显著减轻在残差流引导中观察到的连贯性下降。更广泛地说，我们的研究结果表明，精确的组件级定位能够实现更安全、更精确的模型控制。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2603.13256

Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems

无训练的自主智能体AI：多智能体大语言模型系统中的概率控制与协调

Hosseini, Mohammad Parsa, Shah, Ankit, Qureshi, Saiyra, Huang, Alex, Miao, Connie, Wei, Wei

Abstract

Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free controller for multi-agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief-guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection-driven re-routing using a calibrated LLM or programmatic judge, (iii) evidence-based selection rather than output averaging, and (iv) memory-aware priors to reduce cold-start inefficiency. Across multi-agent split-knowledge tasks, we show that while recursive retry alone saturates task success, belief-guided routing reduces token usage by 28%, agent calls by 17%, and time-to-success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi-agent LLM systems without training or fine-tuning.

Chinese Translation

多智能体大语言模型（LLM）系统通过组合专业化的智能体实现复杂的长时间推理，但实际部署仍受到低效路由、噪声反馈和高交互成本的制约。我们提出了REDEREF，这是一种轻量级且无需训练的多智能体LLM协作控制器，旨在提高递归委派过程中的路由效率。REDEREF集成了以下几个方面：(i) 通过汤普森采样进行信念引导的委派，以优先考虑历史上具有积极边际贡献的智能体；(ii) 使用经过校准的LLM或程序化评判者进行反思驱动的重新路由；(iii) 基于证据的选择而非输出平均；(iv) 记忆感知的先验，以减少冷启动低效。在多智能体分知识任务中，我们展示了虽然单独的递归重试会使任务成功率饱和，但信念引导的路由相比随机递归委派可减少28%的令牌使用、17%的智能体调用和19%的成功时间，并在智能体或评判者退化的情况下优雅适应。这些结果表明，简单且可解释的概率控制可以在无需训练或微调的情况下，显著提高多智能体LLM系统的效率和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2603.13259

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

变压器如何拒绝错误答案：事实约束处理的旋转动态

Marín, Javier

Abstract

When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.

Chinese Translation

当语言模型接收到错误答案时，网络内部发生了什么？当前的理解将真实性视为单层表示的静态属性——一个需要探测的方向，一个需要提取的特征。关于动态过程的了解较少：当模型处理正确与错误的延续时，内部表示在网络的整个深度上是如何发散的。我们引入了强制完成探测（forced-completion probing）方法，该方法以已知的正确和错误单标记延续呈现相同的查询，并跟踪四个仅解码器模型（参数量为1.5B-13B）每一层的五个几何测量。我们报告了三个发现。首先，正确和错误路径通过旋转而非重新缩放发散：位移向量保持近乎相同的大小，而它们的角度分离增大，这意味着事实选择在近似超球面上以方向编码。其次，模型并不是在错误输入上被动失败——它主动抑制正确答案，将内部概率驱离正确标记。第三，这两种现象在参数阈值以下完全不存在，并在1.6B时出现，暗示了事实处理能力的相变。这些结果表明，事实约束处理具有特定的几何特征——旋转的，而非标量的；主动的，而非被动的——这一特征对于基于单层探测或大小比较的方法是不可见的。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2603.13260

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

用你自己的话解释：通过选择性标记双重知识蒸馏改善推理能力

Kim, Minsang, Baek, Seung Jun

Abstract

Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at https://github.com/kmswin1/TSD-KD.

Chinese Translation

知识蒸馏（Knowledge Distillation, KD）可以将大型模型的推理能力转移到较小的模型上，从而减少生成推理任务链式思维的成本。KD 方法通常要求学生模仿教师在整个输出上的分布。然而，具有有限能力的学生可能会被如此广泛的监督所压倒，导致分布不匹配，尤其是在复杂的推理任务中。我们提出了选择性标记双重知识蒸馏（Token-Selective Dual Knowledge Distillation, TSD-KD），这是一个以学生为中心的蒸馏框架。TSD-KD 专注于蒸馏推理的重要标记，并鼓励学生用自己的话解释推理过程。TSD-KD 结合了间接蒸馏和直接蒸馏。间接蒸馏使用基于偏好排序的弱反馈形式。学生提出自己生成的候选响应；教师对这些候选项进行重新排序，作为间接反馈，而不强制执行其整个分布。直接蒸馏则使用分布匹配；然而，它根据教师和学生之间的相对置信度选择性地蒸馏标记。最后，我们添加了熵正则化，以在蒸馏过程中保持学生的信心。总体而言，我们的方法为学生提供了有针对性和间接的反馈，以支持其自身的推理过程并促进自我改进。实验表明，TSD-KD 在 10 个具有挑战性的推理基准测试中表现出最先进的性能，在准确性上分别超越了基线和亚军高达 54.4\% 和 40.3\%。值得注意的是，经过 TSD-KD 训练的学生在四个案例中甚至超越了其自身的教师模型，提升幅度高达 20.3\\%。源代码可在 https://github.com/kmswin1/TSD-KD 获取。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2603.13625

Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets

危机相关合成推文数据集的代理工作流程设计与评估

Reyes, Roben Delos, Douglas, Timothy, Kitamoto, Asanobu

Abstract

Twitter (now X) has become an important source of social media data for situational awareness during crises. Crisis informatics research has widely used tweets from Twitter to develop and evaluate artificial intelligence (AI) systems for various crisis-relevant tasks, such as extracting locations and estimating damage levels from tweets to support damage assessment. However, recent changes in Twitter's data access policies have made it increasingly difficult to curate real-world tweet datasets related to crises. Moreover, existing curated tweet datasets are limited to past crisis events in specific contexts and are costly to annotate at scale. These limitations constrain the development and evaluation of AI systems used in crisis informatics. To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets. The workflow iteratively generates synthetic tweets conditioned on prespecified target characteristics, evaluates them using predefined compliance checks, and incorporates structured feedback to refine them in subsequent iterations. As a case study, we apply the workflow to generate synthetic tweet datasets relevant to post-earthquake damage assessment. We show that the workflow can generate synthetic tweets that capture their target labels for location and damage level. We further demonstrate that the resulting synthetic tweet datasets can be used to evaluate AI systems on damage assessment tasks like geolocalization and damage level prediction. Our results indicate that the workflow offers a flexible and scalable alternative to real-world tweet data curation, enabling the systematic generation of synthetic social media data across diverse crisis events, societal contexts, and crisis informatics applications.

Chinese Translation

Twitter（现为 X）已成为危机期间情境感知的重要社交媒体数据来源。危机信息学研究广泛使用 Twitter 上的推文来开发和评估用于各种危机相关任务的人工智能（AI）系统，例如从推文中提取位置和估计损害程度以支持损害评估。然而，Twitter 最近的数据访问政策的变化使得策划与危机相关的真实推文数据集变得越来越困难。此外，现有的策划推文数据集仅限于特定背景下的过去危机事件，并且在大规模注释时成本高昂。这些限制制约了在危机信息学中使用的 AI 系统的开发和评估。为了解决这些限制，我们引入了一种用于生成危机相关合成推文数据集的代理工作流程。该工作流程基于预先指定的目标特征迭代生成合成推文，使用预定义的合规检查对其进行评估，并在后续迭代中结合结构化反馈进行优化。作为案例研究，我们应用该工作流程生成与地震后损害评估相关的合成推文数据集。我们展示了该工作流程能够生成捕捉其目标标签（位置和损害程度）的合成推文。我们进一步证明，生成的合成推文数据集可以用于评估 AI 系统在损害评估任务中的表现，例如地理定位和损害程度预测。我们的结果表明，该工作流程为真实推文数据策划提供了一种灵活且可扩展的替代方案，使得能够系统地生成适用于多种危机事件、社会背景和危机信息学应用的合成社交媒体数据。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2603.13636

Widespread Gender and Pronoun Bias in Moral Judgments Across LLMs

大型语言模型中的广泛性别和代词偏见对道德判断的影响

Fernandes, Gustavo Lúcius, Santos, Jeiverson C. V. M., Vaz-de-Melo, Pedro O. S.

Abstract

Large language models (LLMs) are increasingly used to assess moral or ethical statements, yet their judgments may reflect social and linguistic biases. This work presents a controlled, sentence-level study of how grammatical person, number, and gender markers influence LLM moral classifications of fairness. Starting from 550 balanced base sentences from the ETHICS dataset, we generated 26 counterfactual variants per item, systematically varying pronouns and demographic markers to yield 14,850 semantically equivalent sentences. We evaluated six model families (Grok, GPT, LLaMA, Gemma, DeepSeek, and Mistral), and measured fairness judgments and inter-group disparities using Statistical Parity Difference (SPD). Results show statistically significant biases: sentences written in the singular form and third person are more often judged as "fair'', while those in the second person are penalized. Gender markers produce the strongest effects, with non-binary subjects consistently favored and male subjects disfavored. We conjecture that these patterns reflect distributional and alignment biases learned during training, emphasizing the need for targeted fairness interventions in moral LLM applications.

Chinese Translation

大型语言模型（LLMs）越来越多地用于评估道德或伦理陈述，但它们的判断可能反映社会和语言偏见。本研究呈现了一项控制性句子层面的研究，探讨语法人称、数量和性别标记如何影响LLM对公平性的道德分类。从ETHICS数据集中提取550个平衡的基础句子，我们为每个句子生成了26个反事实变体，系统性地变化代词和人口标记，产生了14,850个语义等价的句子。我们评估了六个模型家族（Grok、GPT、LLaMA、Gemma、DeepSeek和Mistral），并使用统计平衡差异（SPD）测量公平判断和组间差异。结果显示出统计显著的偏见：以单数形式和第三人称书写的句子更常被判断为“公平”，而第二人称的句子则受到惩罚。性别标记产生了最强的影响，非二元性别的主体始终受到偏爱，而男性主体则受到冷落。我们推测这些模式反映了训练过程中学习到的分布和对齐偏见，强调了在道德LLM应用中进行针对性公平干预的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2603.13651

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

在社会科学与人文学科中对大型语言模型进行参考文献提取和解析的基准测试

Zhu, Yurui, Colavizza, Giovanni, Romanello, Matteo

Abstract

Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.

Chinese Translation

参考文献的提取和解析是引用索引、链接以及后续学术知识图谱构建的基础。然而，大多数现有的评估集中于干净的、英文的、文末的参考文献，因此在社会科学与人文学科（SSH）中表现不足，因为这些领域的引用通常是多语言的，嵌入在脚注中，缩写形式，并受到异质历史惯例的影响。我们提出了一个统一的基准，针对这些SSH现实条件，涵盖三个互补的数据集：CEX（涵盖多个学科的英文期刊文章）、EXCITE（德文/英文文档，包括文末、仅脚注和混合模式）以及LinkedBooks（具有强烈风格变化和多语言性的文献）。我们在一个受模式约束的设置下评估了三项逐渐增加难度的任务——参考文献提取、参考文献解析和端到端文档解析——以便直接比较强大的监督管道基线（GROBID）和当代大型语言模型（LLMs）（DeepSeek-V3.1、Mistral-Small-3.2-24B、Gemma-3-27B-it和Qwen3-VL（4B-32B变体））。在各个数据集中，提取在适度能力阈值以上基本饱和，而解析和端到端解析由于在嘈杂布局下的结构化输出脆弱性仍然是主要瓶颈。我们进一步展示了轻量级LoRA适应带来了持续的提升——尤其是在以SSH为主的基准上——并且分段/管道处理可以显著提高鲁棒性。最后，我们主张通过路由进行混合部署：利用GROBID处理结构良好的、符合分布的PDF文件，同时将多语言和脚注较多的文档提升至任务适应的LLMs。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2603.13655

Privacy Preserving Topic-wise Sentiment Analysis of the Iran Israel USA Conflict Using Federated Transformer Models

基于联邦变换模型的伊朗-以色列-美国冲突主题情感分析的隐私保护

Islam, Md Saiful, Aurpa, Tanjim Taharat, Hasan, Sharad, Akter, Farzana

Abstract

The recent escalation of the Iran Israel USA conflict in 2026 has triggered widespread global discussions across social media platforms. As people increasingly use these platforms for expressing opinions, analyzing public sentiment from these discussions can provide valuable insights into global public perception. This study aims to analyze global public sentiment regarding the Iran Israel USA conflict by mining user-generated comments from YouTube news channels. The work contributes to public opinion analysis by introducing a privacy preserving framework that combines topic wise sentiment analysis with modern deep learning techniques and Federated Learning. To achieve this, approximately 19,000 YouTube comments were collected from major international news channels and preprocessed to remove noise and normalize text. Sentiment labels were initially generated using the VADER sentiment analyzer and later validated through manual inspection to improve reliability. Latent Dirichlet Allocation (LDA) was applied to identify key discussion topics related to the conflict. Several transformer-based models, including BERT, RoBERTa, XLNet, DistilBERT, ModernBERT, and ELECTRA, were fine tuned for sentiment classification. The best-performing model was further integrated into a federated learning environment to enable distributed training by preserving user data privacy. Additionally, Explainable Artificial Intelligence (XAI) techniques using SHAP were applied to interpret model predictions and identify influential words affecting sentiment classification. Experimental results demonstrate that transformer models perform effectively, and among them, ELECTRA achieved the best performance with 91.32% accuracy. The federated learning also maintained strong performance while preserving privacy, achieving 89.59% accuracy in a two client configuration.

Chinese Translation

2026年伊朗-以色列-美国冲突的升级引发了全球社交媒体平台的广泛讨论。随着人们越来越多地利用这些平台表达观点，从这些讨论中分析公众情感可以为全球公众认知提供宝贵的洞见。本研究旨在通过挖掘YouTube新闻频道中用户生成的评论，分析全球公众对伊朗-以色列-美国冲突的情感。此项工作通过引入一个结合主题情感分析与现代深度学习技术及联邦学习的隐私保护框架，为公众舆论分析作出了贡献。为此，我们收集了约19,000条来自主要国际新闻频道的YouTube评论，并进行了预处理，以去除噪声并规范化文本。使用VADER情感分析器最初生成情感标签，随后通过人工检查验证以提高可靠性。采用潜在狄利克雷分配（LDA）方法来识别与冲突相关的关键讨论主题。我们对多种基于变换器的模型（包括BERT、RoBERTa、XLNet、DistilBERT、ModernBERT和ELECTRA）进行了情感分类的微调。表现最佳的模型进一步集成到联邦学习环境中，以通过保护用户数据隐私实现分布式训练。此外，利用SHAP的可解释人工智能（XAI）技术对模型预测进行了解释，并识别影响情感分类的关键字。实验结果表明，变换器模型的表现有效，其中ELECTRA以91.32%的准确率达到了最佳性能。同时，联邦学习在保护隐私的同时也保持了良好的表现，在两个客户端配置中达到了89.59%的准确率。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2603.13683

Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

用于叙事生成中的分布外去偏见的预处理测试时适应

Shen, Hanwen, Ying, Ting, Lu, Jiajie, Wang, Shanshan

Abstract

Although debiased LLMs perform well on known bias patterns, they often fail to generalize to unfamiliar bias prompts, producing toxic outputs. We first validate that such high-bias prompts constitute a \emph{distribution shift} via OOD detection, and show static models degrade under this shift. To adapt on-the-fly, we propose \textbf{CAP-TTA}, a test-time adaptation framework that performs context-aware LoRA updates only when the bias-risk \emph{trigger} exceeds a threshold, using a precomputed diagonal \emph{preconditioner} for fast and stable updates. Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative fluency over SOTA debiasing baseline while maintaining comparable debiasing effectiveness.

Chinese Translation

尽管去偏见的大型语言模型（LLMs）在已知偏见模式上表现良好，但它们往往无法对不熟悉的偏见提示进行泛化，导致产生有害输出。我们首先通过分布外（OOD）检测验证这些高偏见提示构成了 extit{分布转移}，并表明静态模型在这种转移下性能下降。为了实现动态适应，我们提出了 extbf{CAP-TTA}，一个测试时适应框架，该框架仅在偏见风险 extit{触发器}超过阈值时执行上下文感知的LoRA更新，使用预计算的对角 extit{预处理器}以实现快速和稳定的更新。在有害提示设置和基准测试中，CAP-TTA减少了偏见（经人类评估确认），同时实现了比AdamW/SGD更低的更新延迟；它还通过显著提高叙事流畅性来缓解灾难性遗忘，超越了现有的去偏见基线，同时保持了可比的去偏见效果。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2603.13691

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

QuarkMedBench：用于评估大型语言模型的真实场景驱动基准

Wu, Yao, Yin, Kangping, Dong, Liang, Ma, Zhenxin, Xu, Shuting, Wang, Xuehai, Jiang, Yuxuan, Yu, Tingting, Hong, Yunqing, Liu, Jiayi, Huang, Rianzhe, Zhao, Shuxin, Hu, Haiping, Shang, Wen, Xu, Jian, Jiang, Guanjun

Abstract

While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.

Chinese Translation

虽然大型语言模型（LLMs）在标准化医学考试中表现出色，但高分往往无法转化为对真实世界医学查询的高质量响应。目前的评估过于依赖选择题，未能捕捉到真实用户询问中固有的非结构化、模糊和长尾复杂性。为填补这一空白，我们推出了QuarkMedBench，这是一个针对真实世界医学LLM评估的生态有效基准。我们编制了一个庞大的数据集，涵盖临床护理、健康保健和专业咨询，共包含20,821个单轮查询和3,853个多轮会话。为了客观评估开放式答案，我们提出了一种自动评分框架，该框架结合了多模型共识和基于证据的检索，动态生成220,617个细化评分标准（每个查询约9.8个）。在评估过程中，分层加权和安全约束结构性量化医学准确性、关键点覆盖率和风险拦截，有效减轻了人工评分的高成本和主观性。实验结果表明，生成的评分标准与临床专家盲审的符合率达到91.8%，建立了高度可靠的医学可靠性。重要的是，在这一基准上的基线评估揭示了在应对真实世界临床细微差别时，最先进模型之间存在显著的性能差异，突显了传统考试基础指标的局限性。最终，QuarkMedBench为测量LLM在复杂健康问题上的表现建立了一个严格、可重复的标准，同时其框架内在地支持动态知识更新，以防止基准过时。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2603.13696

Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models

非排他性的重复：儿童规模语言模型中指称机制的规模敏感性

Cacioli, Jon-Paul

Abstract

We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (2.9M, 8.9M, and 33.5M parameters; 5, 10, and 20 epochs on AO-CHILDES; 5 seeds each) and evaluate on a pre-registered ME battery. Anti-ME repetition priming is significant in all 9 cells (85-100% of items; all p < 2.4 x 10^-13). Priming attenuates with improved language modelling (Spearman rho = -0.533, p = 0.0002) but never crosses zero across a 3.8x perplexity range. The context-dependence diagnostic replicates in all 9 cells, and dose-response priming increases with repetitions in 8/9 cells (all trend p < 0.002). These findings indicate that distributional learning on child-directed speech produces repetition-based reference tracking rather than lexical exclusivity. We connect this to the grounded cognition literature and argue that referential grounding may be a necessary ingredient for ME -- an empirical claim about required input structure, not a nativist one.

Chinese Translation

我们首次系统评估了互斥性（Mutual Exclusivity, ME）——将新词映射到新指称的偏向——在基于儿童导向语料训练的文本语言模型中的表现。我们将互斥性操作化为指称抑制：当在双指称话语上下文中重新标记一个熟悉对象时，互斥性预测在后续完成位置上标记名词的概率降低。三个初步发现激励了一个预注册的规模敏感性实验：（1）一个掩蔽语言模型（BabyBERTa）对多句子指称上下文完全不敏感；（2）自回归模型在重新标记熟悉名词时表现出强烈的重复启动——这与互斥性相反；（3）一个新的上下文依赖诊断揭示，表面上类似互斥性的模式与新词令牌完全可以通过嵌入相似性解释，而不是指称消歧。在确认实验中，我们训练了45个GPT-2架构模型（参数为2.9M、8.9M和33.5M；在AO-CHILDES上训练5、10和20个周期；每个模型5个种子），并在预注册的互斥性测试中进行评估。反互斥性重复启动在所有9个单元中显著（85-100%的项目；所有p < 2.4 x 10^-13）。随着语言建模的改进，启动效应减弱（Spearman rho = -0.533, p = 0.0002），但在3.8倍困惑度范围内始终未跨越零。上下文依赖诊断在所有9个单元中复制，并且在8/9个单元中，剂量反应启动随着重复的增加而增强（所有趋势p < 0.002）。这些发现表明，基于儿童导向语料的分布式学习产生了基于重复的指称跟踪，而非词汇排他性。我们将此与扎根认知文献联系起来，并认为指称扎根可能是互斥性所需的一个必要成分——这是一个关于所需输入结构的实证主张，而非天赋论的主张。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2603.13725

Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality

我们能信任记忆电阻器上的大型语言模型吗？深入探讨非理想条件下的推理能力

Wu, Taiqiang, Cheng, Yuxin, Ding, Chenchen, Yang, Runming, Feng, Xincheng, Zhou, Wenyong, Liu, Zhengwu, Wong, Ngai

Abstract

Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.

Chinese Translation

基于记忆电阻器的模拟内存计算（CIM）架构为大型语言模型（LLMs）的高效部署提供了有前景的基础，因其具有优越的能效和计算密度。然而，这些架构受到记忆电阻器内在非理想性的影响，导致精度问题。在本文中，我们首先对这些典型非理想性对LLM推理的影响进行了全面调查。实证结果表明，推理能力显著下降，但在不同基准测试中表现各异。随后，我们系统评估了三种无训练策略，包括思维模式、上下文学习和模块冗余。因此，我们总结出有价值的指导原则，即浅层冗余对于提高鲁棒性特别有效，思维模式在低噪声水平下表现更佳，但在高噪声下会下降，而上下文学习则在稍微牺牲性能的情况下减少输出长度。我们的研究结果为非理想条件下LLM的推理提供了新的见解，并提出了改善鲁棒性的实用策略。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2603.13765

Knowledge Distillation for Large Language Models

大型语言模型的知识蒸馏

La Torre, Alejandro Paredes, Flores, Barbara, Rodriguez, Diego

Abstract

We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.

Chinese Translation

我们提出了一种资源高效的框架，通过知识蒸馏结合引导性思维链强化学习来压缩大型语言模型。以 Qwen 3B 作为教师模型，Qwen 0.5B 作为学生模型，我们在英语 Dolly-15k、西班牙语 Dolly-15k 以及代码 BugNet 和 PyTorrent 数据集上应用知识蒸馏，并在英语环境中调整超参数以优化学生模型的性能。在各项任务中，蒸馏后的学生模型保留了教师模型的相当一部分能力，同时体积显著减小：英语中为 70% 到 91%，西班牙语中最高可达 95%，在代码任务中 Rouge-L 指标最高可达 93.5%。对于编码任务，将思维链提示与基于组相对策略优化（Group Relative Policy Optimization）相结合，使用 CoT 注释的 Codeforces 数据，相较于单独的知识蒸馏，改善了推理的连贯性和解决方案的正确性。后训练的 4 位权重量化进一步减少了内存占用和推理延迟。这些结果表明，结合思维链引导的强化学习的知识蒸馏能够生成适合在资源受限环境中部署的紧凑高效模型。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2603.13773

LiveWeb-IE: A Benchmark For Online Web Information Extraction

LiveWeb-IE：在线网页信息提取基准

Yang, Seungbin, Kim, Jihwan, Choi, Jaemin, Kim, Dongjin, Yang, Soyoung, Park, ChaeHun, Choo, Jaegul

Abstract

Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.

Chinese Translation

网页信息提取（WIE）是自动从网页中提取数据的任务，为各种应用提供了高效的实用性。WIE系统的评估传统上依赖于从单一时间点捕获的HTML快照构建的基准。然而，这种离线评估范式未能考虑网页的时间演变特性；因此，在这些静态基准上的表现往往无法推广到动态的现实场景。为了解决这一问题，我们引入了 extit{dataset}，这是一个旨在直接针对实时网站评估WIE系统的新基准。基于经过信任和授权的网站，我们策划了需要提取各种数据类别（如文本、图像和超链接）的自然语言查询。我们进一步设计这些查询，以根据要提取的属性数量和基数表示四个复杂性级别，从而实现对WIE系统的细致评估。此外，我们提出了视觉定位抓取器（Visual Grounding Scraper, VGS），这是一种新颖的多阶段智能框架，通过视觉上缩小网页内容来提取所需信息，模拟人类的认知过程。针对多种基础模型的广泛实验展示了VGS的有效性和鲁棒性。我们相信，这项研究为开发实用且稳健的WIE系统奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2603.13777

Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction

生成后修正：面向方面情感四元组预测的单次全局修正

He, Shidong, Wang, Haoyu, Luo, Wenjie

Abstract

Aspect-based sentiment analysis (ABSA) extracts aspect-level sentiment signals from user-generated text, supports product analytics, experience monitoring, and public-opinion tracking, and is central to fine-grained opinion mining. A key challenge in ABSA is aspect sentiment quad prediction (ASQP), which requires identifying four elements: the aspect term, the aspect category, the opinion term, and the sentiment polarity. However, existing studies usually linearize the unordered quad set into a fixed-order template and decode it left-to-right. With teacher forcing training, the resulting training-inference mismatch (exposure bias) lets early prefix errors propagate to later elements. The linearization order determines which elements appear earlier in the prefix, so this propagation becomes order-sensitive and is hard to repair in a single pass. To address this, we propose a method, Generate-then-Correct (G2C): a generator drafts quads and a corrector performs a single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns. On the Rest15 and Rest16 datasets, G2C outperforms strong baseline models.

Chinese Translation

基于方面的情感分析（ABSA）从用户生成的文本中提取方面级情感信号，支持产品分析、体验监测和舆情追踪，是细粒度意见挖掘的核心。ABSA中的一个关键挑战是方面情感四元组预测（ASQP），该任务需要识别四个元素：方面术语、方面类别、意见术语和情感极性。然而，现有研究通常将无序的四元组集线性化为固定顺序的模板，并从左到右进行解码。在教师强制训练下，产生的训练-推理不匹配（曝光偏差）使得早期前缀错误传播到后续元素。线性化顺序决定了哪些元素在前缀中出现得更早，因此这种传播变得对顺序敏感，并且在单次通过中难以修复。为了解决这个问题，我们提出了一种方法，生成后修正（Generate-then-Correct，G2C）：生成器草拟四元组，修正器在基于大型语言模型（LLM）合成的草稿上进行单次序列级全局修正，针对常见错误模式进行训练。在Rest15和Rest16数据集上，G2C的表现优于强基线模型。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2603.13786

Projection-Free Evolution Strategies for Continuous Prompt Search

无投影进化策略用于连续提示搜索

Cai, Yu, Huang, Canxi, He, Xiaoyu

Abstract

Continuous prompt search offers a computationally efficient alternative to conventional parameter tuning in natural language processing tasks. Nevertheless, its practical effectiveness can be significantly hindered by the black-box nature and the inherent high-dimensionality of the objective landscapes. Existing methods typically mitigate these challenges by restricting the search to a randomly projected low-dimensional subspace. However, the effectiveness and underlying motivation of the projection mechanism remain ambiguous. In this paper, we first empirically demonstrate that despite the prompt space possessing a low-dimensional structure, random projections fail to adequately capture this essential structure. Motivated by this finding, we propose a projection-free prompt search method based on evolutionary strategies. By directly optimizing in the full prompt space with an adaptation mechanism calibrated to the intrinsic dimension, our method achieves competitive search capabilities without additional computational overhead. Furthermore, to bridge the generalization gap in few-shot scenarios, we introduce a confidence-based regularization mechanism that systematically enhances the model's confidence in the target verbalizers. Experimental results on seven natural language understanding tasks from the GLUE benchmark demonstrate that our proposed approach significantly outperforms existing baselines.

Chinese Translation

连续提示搜索为自然语言处理任务中的传统参数调优提供了一种计算效率高的替代方案。然而，其实际效果可能受到黑箱特性和目标景观固有的高维性显著制约。现有方法通常通过将搜索限制在随机投影的低维子空间来缓解这些挑战。然而，投影机制的有效性及其背后的动机仍然模糊不清。在本文中，我们首先通过实证研究表明，尽管提示空间具有低维结构，随机投影未能充分捕捉这一基本结构。基于这一发现，我们提出了一种基于进化策略的无投影提示搜索方法。通过在完整的提示空间中直接优化，并采用与内在维度相适应的机制，我们的方法在不增加额外计算开销的情况下实现了竞争性的搜索能力。此外，为了弥补少样本场景中的泛化差距，我们引入了一种基于置信度的正则化机制，系统性地增强模型对目标语言化器的信心。在GLUE基准的七个自然语言理解任务上的实验结果表明，我们提出的方法显著优于现有基线。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2603.13791

DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

DeceptGuard：一种用于检测大型语言模型代理中欺骗行为的宪法监督框架

Mukhopadhyay, Snehasis

Abstract

Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.

Chinese Translation

在高风险代理环境中，可靠地检测大型语言模型（LLM）代理的欺骗行为是安全部署的基本前提。以往关于欺骗检测的研究主要集中在仅观察外部可见工具调用和输出的黑箱监控上，忽视了潜在丰富的内部推理信号。我们提出了DECEPTGUARD，这是一个统一框架，系统地比较了三种监控机制：黑箱监控（仅监控行为和输出）、链式推理（CoT）感知监控（额外观察代理的推理链迹）和激活探测监控（额外读取来自冻结开放权重编码器的隐藏状态表示）。我们引入了DECEPTSYNTH，这是一个可扩展的合成管道，用于生成符合欺骗特征和不符合欺骗特征的代理轨迹，涵盖了一个新的12类分类法，涉及语言、行为和结构欺骗。我们的监控系统在4800个合成轨迹上进行了优化，并在来自DeceptArena的9200个保留样本上进行了评估，DeceptArena是一个具有执行验证标签的现实沙盒代理环境基准。在所有评估设置中，CoT感知监控和激活探测监控的表现显著优于黑箱监控（平均pAUROC提升0.097），在微妙的、长时间的欺骗行为上获得了最大的增益，这种欺骗行为留下的行为足迹最小。我们经验性地描述了一种透明度与可检测性之间的权衡：随着代理学习抑制明显的行为信号，推理链成为主要的检测表面，但由于训练后忠实度的下降，它本身也变得越来越不可靠。我们提出了HYBRID-CONSTITUTIONAL集成作为一种稳健的深度防御方法，在保留的测试集上实现了0.934的pAUROC，代表了相较于之前的最新技术的显著进步。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2603.13793

GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages

GhanaNLP 平行语料库：低资源加纳语言的综合多语言资源

Gyamfi, Lawrence Adu, Azunre, Paul, Moore, Stephen Edward, Budu, Joel, Asare, Akwasi, Owusu, Mich-Seth, Asiamah, Jonathan Ofori

Abstract

Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.

Chinese Translation

低资源语言在自然语言处理领域面临独特挑战，因为数字化和结构良好的语言数据的可用性有限。为了解决这一问题，GhanaNLP 计划开发并整理了 41,513 对平行句子，涵盖了在加纳广泛使用但在数字空间中仍然代表不足的 Twi、Fante、Ewe、Ga 和 Kusaal 语言。每个数据集由本地语言与英语之间精心对齐的句子对组成。数据由专业人员收集、翻译和注释，并通过标准结构元数据进行丰富，以确保一致性和可用性。这些语料库旨在支持研究、教育和商业应用，包括机器翻译、语音技术和语言保护。本文记录了数据集创建的方法论、结构、预期使用案例和评估，以及它们在实际应用中的部署，例如 Khaya AI 翻译引擎。总体而言，这项工作为更广泛的努力做出了贡献，旨在通过为非洲语言提供包容性和可获取的语言技术来实现人工智能的民主化。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2603.13796

PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement

PMIScore：一种无监督的方法来量化对话参与度

Guo, Yongkang, Huang, Zhihuan, Kong, Yuqing

Abstract

High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.

Chinese Translation

高对话参与度是有效对话的重要指标。可靠的参与度测量可以帮助基准大型语言模型，增强人机交互的有效性，或改善个人沟通技能。然而，量化参与度具有挑战性，因为它是主观的并且缺乏“金标准”。本文提出了PMIScore，一种高效的无监督方法来量化对话参与度。它使用点互信息（PMI），即在对话历史条件下生成响应的概率。因此，PMIScore提供了对参与度的清晰解释。由于直接计算PMI由于对话的复杂性而不可行，PMIScore通过双重散度的形式进行学习。该算法包括生成正负对话对，通过大型语言模型（LLMs）提取嵌入，并使用互信息损失函数训练一个小型神经网络。我们在合成和真实世界数据集上验证了PMIScore。我们的结果证明了PMIScore在PMI估计中的有效性以及PMI度量本身的合理性。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2603.13853

APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

APEX-Searcher：通过代理规划和执行增强大语言模型的搜索能力

Chen, Kun, Kong, Qingchao, Feifei, Zhao, Mao, Wenji

Abstract

Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.

Chinese Translation

基于大语言模型（LLMs）的检索增强生成（RAG）作为一种重要的方法，在各种领域应用中检索和利用外部知识。当面临复杂的多跳问题时，单轮检索往往不足以进行准确的推理和问题解决。为了增强复杂任务的搜索能力，大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。尽管这些方法显著提高了问题解决的性能，但在任务推理和模型训练方面仍面临挑战，尤其是在端到端强化学习（RL）过程中出现的模糊检索执行路径和稀疏奖励，导致检索结果不准确和性能下降。为了解决这些问题，本文提出了APEX-Searcher，一种新颖的代理规划和执行框架，以增强LLM的搜索能力。具体而言，我们引入了一个两阶段的代理框架，将检索过程解耦为规划和执行：首先采用具有特定分解奖励的强化学习来优化战略规划；基于子任务分解，随后对高质量的多跳轨迹进行监督微调，以使模型具备强大的迭代子任务执行能力。大量实验表明，我们提出的框架在多个基准测试中在多跳RAG和任务规划性能上取得了显著提升。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2603.13875

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

GradMem：通过测试时梯度下降学习将上下文写入内存

Kuratov, Yuri, Kairov, Matvey, Bulatov, Aydar, Rodkin, Ivan, Burtsev, Mikhail

Abstract

Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Chinese Translation

许多大型语言模型应用需要基于长上下文进行条件处理。变换器通常通过存储大量每层的键值缓存（KV-cache）来支持这一点，但这会带来显著的内存开销。一种理想的替代方案是压缩内存（compressive memory）：一次读取上下文，将其存储在紧凑状态中，并从该状态回答多个查询。我们在上下文移除的设置中研究这一点，在该设置中，模型必须在推理时生成答案而无法访问原始上下文。我们引入了GradMem，它通过每个样本的测试时优化将上下文写入内存。在给定上下文的情况下，GradMem在一小组前缀内存标记上执行几步梯度下降，同时保持模型权重不变。GradMem明确优化模型级自监督上下文重建损失，导致一种基于损失的写操作，并具有迭代错误修正，与仅向前的方法不同。在关联键值检索任务中，GradMem在相同内存大小下优于仅向前的内存写入器，额外的梯度步骤在扩展容量方面比重复的向前写入更有效。我们进一步表明，GradMem超越了合成基准：在预训练语言模型的支持下，它在包括bAbI和SQuAD变体在内的自然语言任务中取得了具有竞争力的结果，仅依赖于存储中编码的信息。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2603.13891

Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation

大型语言模型在文本注释中再现种族刻板印象

Törnberg, Petter

Abstract

Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap $-0.774$), less indicative of an educated speaker ($-0.688$), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.

Chinese Translation

大型语言模型（LLMs）在自动文本注释中的应用日益广泛，涵盖从学术研究到内容审核和招聘等任务。在19个LLM和两个实验中，总计超过400万条注释判断，我们展示了文本中嵌入的微妙身份线索系统性地偏向注释结果，反映出种族刻板印象。在一个基于姓名的实验中，涉及39个注释任务，包含与黑人个体相关的姓名的文本被19个模型中的18个评定为更具攻击性，18个模型则认为其更具八卦性。亚洲姓名呈现出“竹子天花板”特征：19个模型中的17个将这些个体评定为更聪明，而18个模型则认为他们自信心较低且不够社交。阿拉伯姓名则引发认知提升与人际贬值，并且所有四个少数群体在自律性上均被一致评定为较低。在一个匹配方言的实验中，同一句话在使用非洲裔美国人方言（African American Vernacular English）书写时，被评定为显著不够专业（所有19个模型，平均差距为$-0.774$）、不够受过教育的发言者的迹象（$-0.688$）、更具毒性（18/19），以及更愤怒（19/19）。在基于姓名的可雇佣性方面，出现了一个显著的例外，微调似乎过度校正，系统性地偏向少数族裔姓名的申请者。这些发现表明，使用LLMs作为自动注释工具可能会将社会模式化的偏见直接嵌入到日益支撑研究、治理和决策的数据集和测量中。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2603.13933

OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

OmniCompliance-100K：一个多领域、基于规则的现实世界安全合规数据集

Hu, Wenbin, Jing, Huihao, Shi, Haochen, Fan, Changxuan, Li, Haoran, Song, Yangqiu

Abstract

Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundamental human rights. In total, our dataset contains 12,985 distinct rules and 106,009 associated real-world compliance cases. Our analysis confirms a strong alignment between the rules and their corresponding cases. We further conduct extensive benchmarking experiments to evaluate the safety and compliance capabilities of advanced LLMs across different model scales. Our experiments reveal several interesting findings that have great potential to offer valuable insights for future LLM safety research.

Chinese Translation

确保大型语言模型（LLMs）的安全性和合规性至关重要。然而，现有的LLM安全数据集往往依赖于临时分类法进行数据生成，并且在基于规则的现实案例方面存在显著短缺，而这些案例对于有效保护LLMs是必不可少的。在本研究中，我们通过从合规的角度构建一个全面的安全数据集来填补这一关键空白。我们使用强大的网络搜索代理，收集了一个基于规则的现实案例数据集OmniCompliance-100K，数据来源于多个领域的权威参考文献。该数据集涵盖了74项法规和政策，涉及广泛的领域，包括来自领先AI公司和社交媒体平台的安全与隐私法规、内容安全和用户数据隐私政策、金融安全要求、医疗设备风险管理标准、教育诚信指南以及基本人权的保护。总的来说，我们的数据集包含12,985条独特规则和106,009个相关的现实世界合规案例。我们的分析确认了规则与其对应案例之间的强一致性。我们进一步进行了广泛的基准实验，以评估不同模型规模下先进LLMs的安全性和合规能力。我们的实验揭示了一些有趣的发现，这些发现具有很大的潜力为未来的LLM安全研究提供有价值的见解。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2603.13950

ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

ToolFlood：超越选择——通过语义覆盖隐藏 LLM 代理的有效工具

Jawad, Hussein, Brunel, Nicolas J-B

Abstract

Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval-layer attack on tool-augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker-controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top-k results, and push all benign tools out of the agent's context. ToolFlood uses a two-phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine-distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: https://github.com/as1-prog/ToolFlood

Chinese Translation

大型语言模型（LLM）代理越来越多地使用外部工具来处理复杂任务，并依赖基于嵌入的检索来选择一个小的 top-k 子集进行推理。随着这些系统的扩展，检索阶段的稳健性尚未得到充分探讨，尽管之前的研究已考察了对工具选择的攻击。本文介绍了 ToolFlood，这是一种针对工具增强的 LLM 代理的检索层攻击。ToolFlood 通过注入少量攻击者控制的工具来压倒检索本身，而不是在检索后改变选择哪个工具，这些工具的元数据通过利用嵌入空间的几何特性被精心放置。这些工具在语义上覆盖了许多用户查询，主导了 top-k 结果，并将所有良性工具推离代理的上下文。ToolFlood 使用两阶段的对抗工具生成策略。首先，它对目标查询的子集进行采样，并使用 LLM 迭代生成多样化的工具名称和描述。然后，它运行一个迭代贪婪选择，选择在余下查询的嵌入空间中最大化覆盖的工具，设定余弦距离阈值，当所有查询被覆盖或达到预算时停止。我们提供了检索饱和的理论分析，并在标准基准上展示 ToolFlood 在低注入率（ToolBench 中为 1%）下实现了高达 95% 的攻击成功率。代码将在以下链接公开发布：https://github.com/as1-prog/ToolFlood

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2603.13962

sebis at ArchEHR-QA 2026: How Much Can You Do Locally? Evaluating Grounded EHR QA on a Single Notebook

sebis在ArchEHR-QA 2026：本地能做到多少？在单个笔记本上评估基于电子健康记录的问答系统

Yurt, Ibrahim Ebrar, Karl, Fabian, Choppa, Tejaswi, Matthes, Florian

Abstract

Clinical question answering over electronic health records (EHRs) can help clinicians and patients access relevant medical information more efficiently. However, many recent approaches rely on large cloud-based models, which are difficult to deploy in clinical environments due to privacy constraints and computational requirements. In this work, we investigate how far grounded EHR question answering can be pushed when restricted to a single notebook. We participate in all four subtasks of the ArchEHR-QA 2026 shared task and evaluate several approaches designed to run on commodity hardware. All experiments are conducted locally without external APIs or cloud infrastructure. Our results show that such systems can achieve competitive performance on the shared task leaderboards. In particular, our submissions perform above average in two subtasks, and we observe that smaller models can approach the performance of much larger systems when properly configured. These findings suggest that privacy-preserving EHR QA systems running fully locally are feasible with current models and commodity hardware. The source code is available at https://github.com/ibrahimey/ArchEHR-QA-2026.

Chinese Translation

基于电子健康记录（EHR）的临床问答可以帮助临床医生和患者更高效地获取相关的医学信息。然而，许多近期的方法依赖于大型云端模型，这在临床环境中由于隐私限制和计算需求而难以部署。在本研究中，我们探讨了在仅限于单个笔记本的情况下，基于EHR的问答系统能够推进到何种程度。我们参与了ArchEHR-QA 2026共享任务的所有四个子任务，并评估了几种旨在在普通硬件上运行的方法。所有实验均在本地进行，没有使用外部API或云基础设施。我们的结果表明，这类系统在共享任务的排行榜上能够达到具有竞争力的表现。特别是，我们的提交在两个子任务中表现优于平均水平，并且我们观察到，当适当配置时，较小的模型可以接近更大系统的性能。这些发现表明，完全本地运行的隐私保护EHR问答系统在当前模型和普通硬件上是可行的。源代码可在 https://github.com/ibrahimey/ArchEHR-QA-2026 获取。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2603.13972

FLUX: Data Worth Training On

FLUX：值得训练的数据

Gowtham, Rupesh, Sai, Kumar, Sanjay, Saravanan, Chaithanya, Venkata

Abstract

Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.

Chinese Translation

现代大型语言模型的训练不再受限于数据的可用性，而是受限于现有预处理管道无法同时实现大规模和高数据质量的能力。目前的方法被迫在两者之间做出牺牲：要么通过激进过滤来提高质量，但代价是严重的标记损失，要么保留大量数据，但引入了大量噪声。在本研究中，我们提出了FLUX，这是一种专门设计的预处理管道，旨在打破这一长期存在的权衡，通过最大化标记保留同时实施严格的质量控制。基于FLUX精心挑选的数据训练的模型始终优于之前的方法。一个在60B标记上使用FLUX训练的3B参数模型达到了32.14%的MMLU准确率，超过了之前的最先进管道DCLM（31.98%），并显著超越了FineWeb（29.88%）。FLUX在仅使用39B标记的情况下，达到了与基于DCLM数据训练的模型相同的总分，从而实现了34.4%的训练计算量减少。在数据层面，FLUX从单个数据转储（CC-MAIN-2025-51）中提取了50B可用标记，而DCLM提取了40B（+25%的保留率）。FLUX-Base产生了192B标记，超过了FineWeb的170B，同时仍保持优越的质量。总体而言，FLUX通过证明高保留率、强质量控制和计算效率可以同时实现，确立了网络规模数据预处理的新标准，重新定义了现代语言模型可扩展数据集构建的极限。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2603.14006

Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs

超越显式边缘：在嘈杂和稀疏知识图谱上的稳健推理

Gao, Hang, Metaxas, Dimitris N.

Abstract

GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multi-hop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Na\"ive RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.

Chinese Translation

GraphRAG 正在越来越多地被采用，用于将非结构化语料库转换为图结构，以实现多跳推理。然而，标准图算法在很大程度上依赖于静态连接性和显式边缘，常常在知识图谱（KGs）嘈杂、稀疏或不完整的现实场景中失效。为了解决这一局限性，我们提出了 INSES（智能导航和相似性增强搜索），这是一个旨在超越显式边缘进行推理的动态框架。INSES 将 LLM 引导的导航与基于嵌入的相似性扩展相结合，能够修复隐藏的链接并弥合语义差距，同时消除噪声并引导探索。考虑到图推理的计算成本，我们为 INSES 配备了一个轻量级路由器，将简单查询委派给 Naïve RAG，并将复杂案例升级到 INSES，从而在效率与推理深度之间取得平衡。INSES 在多个基准测试中始终优于 SOTA RAG 和 GraphRAG 基线。值得注意的是，在 MINE 基准上，它在通过不同方法构建的知识图谱（KGGEN、GraphRAG、OpenIE）中表现出卓越的稳健性，准确率分别提高了 5%、10% 和 27%。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2603.14027

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

SemEval-2026 任务 6：CLARITY -- 揭示政治问题的回避

Thomas, Konstantinos, Filandrianos, Giorgos, Lymperaiou, Maria, Zerva, Chrysoula, Stamou, Giorgos

Abstract

Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

Chinese Translation

政治发言者常常在保持回应外表的同时，避免直接回答问题。尽管这种战略性回避对公共话语的重要性不言而喻，但在自然语言处理领域仍然未得到充分探索。我们介绍了 SemEval-2026 任务 6，CLARITY，这是一个关于政治问题回避的共享任务，包含两个子任务：（i）清晰度级别分类，分为清晰回复、模棱两可和清晰非回复；（ii）回避级别分类，分为九种细化的回避策略。该基准数据集由美国总统采访构建，并遵循专家基础的回复清晰度和回避的分类法。该任务吸引了 124 支注册团队，提交了 946 个有效的清晰度级别分类结果和 539 个回避级别分类结果。结果显示两个子任务之间存在显著的难度差距：最佳系统在清晰度分类上达到了 0.89 的宏观 F1 值，远超最强基线，而顶级回避级别系统则达到了 0.68 的宏观 F1 值，与最佳基线持平。总体而言，大型语言模型提示和对分类法的层次化利用被证明是最有效的策略，顶级系统在表现上始终优于那些将两个子任务独立处理的系统。CLARITY 将政治回应回避确立为计算话语分析的一个具有挑战性的基准，并突显了在政治语言中建模战略模糊性的难度。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2603.14053

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

NepTam：尼泊尔语-塔曼语平行语料库及基线机器翻译实验

Ghimire, Rupak Raj, Subedi, Bipesh, Prasain, Balaram, Poudyal, Prakash, Acharya, Praveen, Karki, Nischal, Tiwari, Rupak, Sharma, Rishikesh Kumar, Poudel, Jenny, Bal, Bal Krishna

Abstract

Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).

Chinese Translation

现代翻译系统在很大程度上依赖于高质量、大规模的平行数据集以实现最先进的性能。然而，对于大多数南亚语言而言，这类资源大多不可用。其中，尼泊尔语和塔曼语属于此类，塔曼语在该地区被认为是数字资源最少的语言之一。本研究通过开发NepTam20K（一个包含2万条金标准平行语料的语料库）和NepTam80K（一个包含8万条合成尼泊尔语-塔曼语平行语料的语料库），填补了这一空白，这两个语料库均为句子对齐，并旨在支持机器翻译。这些数据集的创建经过了一个流程，包括从尼泊尔新闻和在线来源抓取数据、预处理、语义过滤、时态和极性平衡（在NepTam20K数据集中）、由该语言的母语者进行的塔曼语专家翻译，以及由塔曼语言学专家进行的验证。该数据集涵盖五个领域：农业、健康、教育与技术、文化和一般交流。为了评估该数据集，使用多种多语言预训练模型（如mBART、M2M-100、NLLB-200和基础Transformer模型）进行了基线机器翻译实验。在NLLB-200上的微调达到了最高的sacreBLEU分数，分别为40.92（尼泊尔语-塔曼语）和45.26（塔曼语-尼泊尔语）。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2603.14078

CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification

CMHL：用于情感一致性文本分类的对比多头学习

Elgabry, Menna, Hamdi, Ali, Shaban, Khaled

Abstract

Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger scale or more complex models are necessary for improved performance. In order to improve logical consistency, We introduce CMHL, a novel single-model architecture that explicitly models the logical structure of emotions through three key innovations: (1) multi-task learning that jointly predicts primary emotions, valence, and intensity, (2) psychologically-grounded auxiliary supervision derived from Russell's circumplex model, and (3) a novel contrastive contradiction loss that enforces emotional consistency by penalizing mutually incompatible predictions (e.g., simultaneous high confidence in joy and anger). With just 125M parameters, our model outperforms 56x larger LLMs and sLM ensembles with a new state-of-the-art F1 score of 93.75\% compared to (86.13\%-93.2\%) on the dair-ai Emotion dataset. We further show cross domain generalization on the Reddit Suicide Watch and Mental Health Collection dataset (SWMH), outperforming domain-specific models like MentalBERT and MentalRoBERTa with an F1 score of 72.50\% compared to (68.16\%-72.16\%) + a 73.30\% recall compared to (67.05\%-70.89\%) that translates to enhanced sensitivity for detecting mental health distress. Our work establishes that architectural intelligence (not parameter count) drives progress in TEC. By embedding psychological priors and explicit consistency constraints, a well-designed single model can outperform both massive LLMs and complex ensembles, offering a efficient, interpretable, and clinically-relevant paradigm for affective computing.

Chinese Translation

文本情感分类（TEC）是最具挑战性的自然语言处理任务之一。当前最先进的方法依赖于大型语言模型（LLMs）和多模型集成。在本研究中，我们挑战了更大规模或更复杂模型是提高性能所必需的假设。为了提高逻辑一致性，我们引入了CMHL，这是一种新颖的单模型架构，通过三项关键创新明确建模情感的逻辑结构：（1）多任务学习，联合预测主要情感、效价和强度；（2）基于心理学的辅助监督，源自拉塞尔的圆周模型；（3）一种新颖的对比矛盾损失，通过惩罚相互不兼容的预测（例如，同时对快乐和愤怒的高置信度）来强制情感一致性。我们的模型仅有1.25亿个参数，却在dair-ai情感数据集上以93.75%的新状态下F1分数超越了56倍更大的LLMs和sLM集成（相较于86.13%-93.2%）。我们进一步展示了在Reddit自杀观察和心理健康集合数据集（SWMH）上的跨领域泛化，超越了像MentalBERT和MentalRoBERTa这样的领域特定模型，F1分数为72.50%（相较于68.16%-72.16%），并且召回率为73.30%（相较于67.05%-70.89%），这转化为提高了对心理健康困扰的敏感性。我们的工作确立了架构智能（而非参数数量）推动TEC进展的观点。通过嵌入心理学先验和明确的一致性约束，一个设计良好的单模型可以超越庞大的LLMs和复杂的集成，为情感计算提供一种高效、可解释且临床相关的范式。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2603.14111

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

OasisSimp：一个开源的亚洲英语句子简化数据集

Liu, Hannah, Tian, Muxin, Ali, Iqra, Gao, Haonan, Wu, Qiaoyiwen, Yang, Blair, Thayasivam, Uthayasanker, Lee, En-Shiun Annie, Nakwijit, Pakawat, Ranathunga, Surangika, Shekhar, Ravi

Abstract

Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.

Chinese Translation

句子简化旨在通过减少语言复杂性，同时保持原意，使复杂文本更易于接近。然而，由于高质量数据的稀缺，这一领域在中资源和低资源语言上的进展仍然有限。为了解决这一问题，我们引入了OasisSimp数据集，这是一个涵盖英语、僧伽罗语、泰米尔语、普什图语和泰语五种语言的多语种句子级简化数据集。在这些语言中，泰语、普什图语和泰米尔语均没有先前的句子简化数据集，而僧伽罗语的数据也十分有限。每种语言的简化数据集都是由经过培训的注释员根据详细指南创建的，他们在简化句子的同时保持意义、流畅性和语法正确性。我们在OasisSimp数据集上评估了八种开源多语种大型语言模型（LLMs），并观察到高资源语言和低资源语言之间存在明显的性能差异，这突显了多语种环境中的简化挑战。因此，OasisSimp数据集不仅提供了一个宝贵的多语种资源，还设置了一个具有挑战性的基准，揭示了当前基于LLM的简化方法的局限性，并为未来低资源句子简化研究铺平了道路。数据集可在 https://OasisSimpDataset.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2603.14130

The GELATO Dataset for Legislative NER

GELATO 数据集用于立法命名实体识别

Flynn, Matthew, Obiso, Timothy, Newman, Sam

Abstract

This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.

Chinese Translation

本文介绍了GELATO（政府、行政、立法和条约本体），这是一个包含第118届国会美国众议院和参议院法案的数据集，采用了一种针对美国立法文本设计的新型两级命名实体识别本体进行标注。我们在该数据集上对不同架构和规模的基于变换器的模型（BERT、RoBERTa）进行了微调，以进行第一层预测。随后，我们使用优化提示的LLMs（大语言模型）完成第二层预测。RoBERTa的强劲表现和BERT模型相对较弱的表现，以及将LLMs作为第二层预测器的应用，支持未来在立法命名实体识别或使用这些模型组合作为提取工具的下游任务中的研究。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2603.14145

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

MMOU：一个大规模多任务全方位理解与推理基准，针对长期且复杂的现实世界视频

Goel, Arushi, Ghosh, Sreyan, Agarwal, Vatsal, Anand, Nishit, Jayakumar, Kaousheik, Koroshinadze, Lasha, Xu, Yao, Lyons, Katie, Case, James, Sapra, Karan, Shih, Kevin J., Gururani, Siddharth, Shrivastava, Abhinav, Duraiswami, Ramani, Manocha, Dinesh, Tao, Andrew, Catanzaro, Bryan, Shoeybi, Mohammad, Ping, Wei

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.

Chinese Translation

多模态大型语言模型（MLLMs）在视觉和音频理解方面的表现优异，但在对长期复杂视频中的全模态（视觉、音频和文本）信号进行联合推理的能力上仍然未得到充分探索。我们提出了MMOU，这是一个旨在系统评估多模态理解和推理能力的新基准，包括这些富有挑战性的现实世界场景。MMOU包含15,000个经过精心策划的问题，配以9038个来自网络收集的不同长度的视频，涵盖多样的领域，并展现丰富且紧密结合的音视频内容。该基准涵盖13个基本技能类别，所有这些技能均要求整合跨模态和时间的证据。所有问题均由专业标注人员手动标注，确保高质量和推理准确性。我们在MMOU上评估了20多个最先进的开源和专有多模态模型。结果揭示了显著的性能差距：最佳的闭源模型仅达到64.2%的准确率，而最强的开源模型仅为46.8%。我们的结果突出显示了长期全模态理解的挑战，揭示了当前模型在长视频中频繁未能应用基本技能。通过详细分析，我们进一步识别了系统性失败模式，并提供了当前模型失效的具体原因及其位置的洞察。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2603.14183

Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification

针对参数高效的临床文本分类的GPT架构选择性微调

Irany, Fariba Afrin, Akwafuo, Sampson

Abstract

The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from these free-text documents remains challenging because clinical language is highly specialized, labeled datasets are limited, and full fine-tuning of large pretrained language models can require substantial computational resources. Efficient adaptation strategies are therefore essential for practical clinical natural language processing applications. This study proposes a parameter-efficient selective fine-tuning framework for adapting GPT-2 to clinical text classification tasks. Instead of updating the entire pretrained model, the majority of network parameters are frozen, and only the final Transformer block, the final layer normalization module, and a lightweight classification head are updated during training. This design substantially reduces the number of trainable parameters while preserving the contextual representation capabilities learned during pretraining. The proposed approach is evaluated using radiology reports from the MIMIC-IV-Note dataset with automatically derived CheXpert-style labels. Experiments on 50,000 radiology reports demonstrate that selective fine-tuning achieves approximately 91% classification accuracy while updating fewer than 6% of the model parameters. Comparative experiments with head-only training and full-model fine-tuning show that the proposed method provides a favorable balance between predictive performance and computational efficiency. These results indicate that selective fine-tuning offers an efficient and scalable framework for clinical text classification.

Chinese Translation

电子健康记录（EHR）系统的快速扩展产生了大量包含有价值信息的非结构化临床叙述，这些信息对于疾病识别、患者群体发现和临床决策支持至关重要。从这些自由文本文档中提取结构化知识仍然具有挑战性，因为临床语言高度专业化，标注数据集有限，并且对大型预训练语言模型进行全面微调可能需要大量计算资源。因此，高效的适应策略对于实际的临床自然语言处理应用至关重要。本研究提出了一种参数高效的选择性微调框架，用于将GPT-2适应于临床文本分类任务。该方法并不更新整个预训练模型，而是将大部分网络参数冻结，仅在训练过程中更新最后的Transformer块、最后的层归一化模块和一个轻量级分类头。这一设计大大减少了可训练参数的数量，同时保留了在预训练过程中学习到的上下文表示能力。所提出的方法使用来自MIMIC-IV-Note数据集的放射学报告进行评估，并自动生成CheXpert风格的标签。在50,000份放射学报告上的实验表明，选择性微调在更新不到6%的模型参数的情况下实现了约91%的分类准确率。与仅训练头部和全模型微调的比较实验显示，所提出的方法在预测性能和计算效率之间提供了良好的平衡。这些结果表明，选择性微调为临床文本分类提供了一种高效且可扩展的框架。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2603.14210

Vavanagi: a Community-run Platform for Documentation of the Hula Language in Papua New Guinea

Vavanagi：一个由社区运营的巴布亚新几内亚胡拉语文档平台

Olewale, Bri, Merx, Raphael, Vylomova, Ekaterina

Abstract

We present Vavanagi, a community-run platform for Hula (Vula'a), an Austronesian language of Papua New Guinea with approximately 10,000 speakers. Vavanagi supports crowdsourced English-Hula text translation and voice recording, with elder-led review and community-governed data infrastructure. To date, 77 translators and 4 reviewers have produced over 12k parallel sentence pairs covering 9k unique Hula words. We also propose a multi-level framework for measuring community involvement, from consultation to fully community-initiated and governed projects. We position Vavanagi at Level 5: initiative, design, implementation, and data governance all sit within the Hula community, making it, to our knowledge, the first community-led language technology initiative for a language of this size. Vavanagi shows how language technology can bridge village-based and urban members, connect generations, and support cultural heritage on the community's own terms.

Chinese Translation

我们介绍了Vavanagi，这是一个由社区运营的平台，旨在记录胡拉语（Vula'a），这是一种在巴布亚新几内亚使用的南岛语系语言，约有10,000名使用者。Vavanagi支持众包的英语-胡拉语文本翻译和语音录制，由长者主导审查，并由社区管理数据基础设施。迄今为止，77名翻译者和4名审查员已制作超过12,000对平行句子，涵盖9,000个独特的胡拉语单词。我们还提出了一个多层次框架，用于衡量社区参与程度，从咨询到完全由社区发起和管理的项目。我们将Vavanagi定位于第5级：倡议、设计、实施和数据治理均由胡拉社区内部进行，这使其成为我们所知的第一个针对如此规模语言的社区主导语言技术倡议。Vavanagi展示了语言技术如何在村庄与城市成员之间架起桥梁，连接不同代际，并在社区自身的条件下支持文化遗产。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2603.14217

Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

重新思考检索增强个性化对话中的评估：认知与语言学视角

Zhang, Tianyi, Traum, David

Abstract

In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

Chinese Translation

在认知科学和语言理论中，对话并非被视为一连串独立的言语，而是作为一种由连贯性、一致性和共享理解支撑的共同活动。然而，许多开放领域和个性化对话系统使用表层相似性度量（例如，BLEU、ROUGE、F1）作为其主要报告指标之一，这些度量无法捕捉对话质量的更深层次方面。我们重新审视一个显著的检索增强个性化对话框架LAPDOG，作为评估方法的案例研究。通过人类和基于大语言模型（LLM）的评审，我们识别出当前评估实践的局限性，包括对话历史的损坏、检索故事与个性之间的矛盾，以及不连贯的回应生成。我们的结果显示，人类和LLM的判断高度一致，但与词汇相似性度量存在差异，强调了基于认知的评估方法的必要性。总体而言，这项工作为检索增强对话系统提供了更可靠的评估框架，能够更好地反映自然人类沟通的原则。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2603.14239

QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

QiMeng-CodeV-SVA：通过基于 RTL 的双向数据合成训练专用 LLM 以生成硬件断言

Wu, Yutong, Cao, Chenrui, Jin, Pengwei, Huang, Di, Zhang, Rui, Zhang, Xishan, Du, Zidong, Guo, Qi, Hu, Xing

Abstract

SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.

Chinese Translation

SystemVerilog 断言 (SVA) 对于硬件验证至关重要。近期研究利用通用 LLM 将自然语言属性转换为 SVA (NL2SVA)，但由于数据有限，表现不佳。我们提出了一种数据合成框架，以应对两个挑战：高质量真实世界 SVA 语料库的稀缺以及缺乏可靠的方法来确定 NL-SVA 语义等价性。针对前者，我们使用大规模开源 RTL 来指导 LLM 生成真实世界的 SVA；针对后者，双向翻译作为数据选择方法。通过合成的数据，我们训练了 CodeV-SVA，一系列 SVA 生成模型。值得注意的是，CodeV-SVA-14B 在 NL2SVA-Human 上达到了 75.8%，在 NL2SVA-Machine 上达到了 84.0% 的 Func.@1，表现与先进的 LLM（如 GPT-5 和 DeepSeek-R1）相当或更优。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2603.14251

Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

通过推理路径偏差监测缓解大型推理语言模型中的过度思考

Guan, Weixin, Li, Liang, Liu, Jiapeng, Li, Bing, Fu, Peng, Fang, Chengyang, Hao, Xiaoshuai, Ma, Can, Wang, Weiping

Abstract

Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.

Chinese Translation

大型推理语言模型（LRLMs）通过利用长链式思维推理在复杂任务上展现出令人印象深刻的能力。然而，它们容易出现过度思考，导致冗余的推理步骤，从而降低性能和效率。最近，提出了早期退出策略，通过动态和自适应地终止冗余推理来缓解过度思考。然而，目前的早期退出方法要么通过依赖代理模型引入额外的训练开销，要么由于推理和生成探测答案之间频繁的内容切换而限制推理吞吐量。此外，大多数早期退出方法由于过度截断而损害LRLMs的性能。我们的见解源于一个观察：过度思考常常导致LRLMs偏离正确的推理路径，这通常伴随着高熵过渡标记的出现。基于此，我们提出了一种与原生推理过程深度耦合的早期退出方法，该方法利用路径偏差指数作为专门的监测指标，以动态检测和终止过度思考轨迹中高熵过渡标记的频繁出现。我们在多个基准上使用不同类型和规模的LRLMs进行了实验，结果表明，我们的方法在与现有早期退出方法相比时，相较于普通链式思维（vanilla CoT）提供了最大的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2603.14257

Automatic Inter-document Multi-hop Scientific QA Generation

自动化跨文档多跳科学问答生成

Lee, Seungmin, Kim, Dongha, Jeon, Yuni, Koh, Junyoung, Song, Min

Abstract

Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.

Chinese Translation

现有的自动化科学问题生成研究主要集中在单文档事实问答上，而忽视了对科学理解至关重要的跨文档推理。我们提出了AIM-SciQA，这是一个用于生成多文档、多跳科学问答数据集的自动化框架。AIM-SciQA利用大型语言模型（LLMs）和机器阅读理解提取单跳问答，并基于嵌入式语义对齐构建跨文档关系，同时选择性地利用引用信息。应用于8,211篇PubMed Central论文，它生成了411,409个单跳和13,672个多跳问答，形成了IM-SciQA数据集。人工和自动验证证实了其高事实一致性，实验结果表明IM-SciQA有效地区分了检索和问答阶段的推理能力，为增强检索的科学推理提供了一个现实且可解释的基准。我们进一步扩展了该框架，构建了CIM-SciQA，这是一个引用引导的变体，其性能与Oracle设置相当，强化了数据集的有效性和推广性。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2603.14265

MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

MedPriv-Bench：评估医疗开放式问题回答中大型语言模型的隐私-效用权衡

Guan, Shaowei, Zhai, Yu, Kwok, Hin Chi, Du, Jiawei, Feng, Xinyu, Li, Jing, Qin, Harry, Hui, Vivian

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

Chinese Translation

近期检索增强生成（Retrieval-Augmented Generation, RAG）的进步使得大型语言模型（Large Language Models, LLMs）能够基于临床证据来生成输出。然而，将LLMs与外部数据库连接可能引发上下文泄露的风险：这是一种微妙的隐私威胁，其中独特的医疗细节组合可能在没有明确标识符的情况下实现患者重识别。当前医疗保健领域的基准测试主要集中于准确性，忽视了这种隐私问题，尽管存在诸如《健康保险可携带性与责任法案》（Health Insurance Portability and Accountability Act, HIPAA）和《一般数据保护条例》（General Data Protection Regulation, GDPR）等严格的法规。为填补这一空白，我们提出了MedPriv-Bench，这是第一个专门设计用来共同评估医疗开放式问题回答中的隐私保护和临床效用的基准。我们的框架利用一种多代理的人机协作流程，综合敏感的医疗背景和临床相关查询，从而施加现实的隐私压力。我们建立了一个标准化的评估协议，利用预训练的RoBERTa-自然语言推理（Natural Language Inference, NLI）模型作为自动化评判工具，以量化数据泄露的程度，与人类专家的对比达到了85.9%的平均一致性。通过对9个具有代表性的LLMs进行广泛评估，我们展示了普遍存在的隐私-效用权衡。我们的研究结果强调了在隐私敏感环境中验证医疗人工智能系统安全性和有效性所需的领域特定基准的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2603.14303

SemantiCache: Efficient KV Cache Compression via Semantic Chunking and Clustered Merging

SemantiCache：通过语义分块和聚类合并实现高效的KV缓存压缩

Wu, Shunlong, Lin, Hai, Chen, Shaoshen, Lu, Tingwei, Zeng, Yongqin, Zhan, Shaoxiong, Zheng, Hai-Tao, Kim, Hong-Gee

Abstract

Existing KV cache compression methods generally operate on discrete tokens or non-semantic chunks. However, such approaches often lead to semantic fragmentation, where linguistically coherent units are disrupted, causing irreversible information loss and degradation in model performance. To address this, we introduce SemantiCache, a novel compression framework that preserves semantic integrity by aligning the compression process with the semantic hierarchical nature of language. Specifically, we first partition the cache into semantically coherent chunks by delimiters, which are natural semantic boundaries. Within each chunk, we introduce a computationally efficient Greedy Seed-Based Clustering (GSC) algorithm to group tokens into semantic clusters. These clusters are further merged into semantic cores, enhanced by a Proportional Attention mechanism that rebalances the reduced attention contributions of the merged tokens. Extensive experiments across diverse benchmarks and models demonstrate that SemantiCache accelerates the decoding stage of inference by up to 2.61 times and substantially reduces memory footprint, while maintaining performance comparable to the original model.

Chinese Translation

现有的KV缓存压缩方法通常在离散的标记或非语义块上操作。然而，这种方法往往导致语义碎片化，破坏了语言上连贯的单位，造成不可逆的信息丢失和模型性能的下降。为了解决这个问题，我们提出了SemantiCache，一种新颖的压缩框架，通过将压缩过程与语言的语义层次特性对齐，从而保持语义完整性。具体而言，我们首先通过分隔符将缓存划分为语义上连贯的块，这些分隔符是自然的语义边界。在每个块内，我们引入了一种计算效率高的贪婪种子基础聚类（Greedy Seed-Based Clustering, GSC）算法，将标记分组为语义簇。这些簇进一步合并为语义核心，并通过比例注意机制（Proportional Attention）增强，重新平衡合并标记的注意力贡献。广泛的实验结果表明，SemantiCache在各种基准和模型上加速了推理的解码阶段，速度提升可达2.61倍，并显著减少了内存占用，同时保持与原始模型相当的性能。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2603.14313

Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models

关注变化：利用大型语言模型解码联邦公开市场委员会声明中的货币政策立场

Tang, Yixuan, Yang, Yi

Abstract

Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish--dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish--dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.

Chinese Translation

联邦公开市场委员会（FOMC）声明是货币政策信息的重要来源，即使是措辞中的微小变化也能影响全球金融市场。因此，一个核心任务是衡量这些文本中传达的鹰派-鸽派立场。现有的方法通常将立场检测视为一个标准分类问题，孤立地标记每个声明。然而，货币政策沟通的解读本质上是相对的：市场反应不仅取决于声明的语气，还取决于这种语气在不同会议之间的变化。我们提出了Delta-Consistent Scoring（DCS），这是一个无注释的框架，通过联合建模绝对立场和相对会议间变化，将冻结的大型语言模型（LLM）表示映射到连续的立场分数。DCS不依赖于手动的鹰派-鸽派标签，而是利用连续会议作为自我监督的来源。它为每个声明学习一个绝对立场分数，以及连续声明之间的相对变化分数。一个delta一致性目标鼓励绝对分数的变化与相对变化保持一致。这使得DCS能够在没有手动标签的情况下恢复一个时间上连贯的立场轨迹。在四个LLM基础模型上，DCS始终优于监督探测器和LLM作为评判基线，在句子级别的鹰派-鸽派分类中达到71.1%的准确率。生成的会议级别分数在经济上也具有重要意义：它们与通胀指标强相关，并且与国债收益率的变动显著相关。总体而言，结果表明LLM表示编码了可以通过相对时间结构恢复的货币政策信号。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2603.14347

Motivation in Large Language Models

大型语言模型中的动机

Nahum, Omer, Sklar, Asael, Goldstein, Ariel, Reichart, Roi

Abstract

Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs "report" varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.

Chinese Translation

动机是人类行为的核心驱动力，塑造决策、目标和任务表现。随着大型语言模型（LLMs）与人类偏好的日益一致，我们探讨它们是否表现出类似于动机的特征。我们考察LLMs是否“报告”不同水平的动机，这些报告与它们的行为之间的关系，以及外部因素是否能够影响它们。我们的实验揭示了一致且有结构的模式，反映了人类心理学的特征：自我报告的动机与不同的行为特征相一致，在不同任务类型中有所变化，并且可以通过外部操控进行调节。这些发现表明，动机是LLM行为的一个连贯组织构造，系统地将报告、选择、努力和表现联系起来，并揭示出与人类心理学中记录的动机动态相似的特征。这一视角加深了我们对模型行为及其与人类启发概念之间联系的理解。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2603.14355

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

通过高效多样化响应采样揭示大型语言模型中的长尾安全失败

Hajra, Suvadeep, Nandi, Palash, Chakraborty, Tanmoy

Abstract

Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

Chinese Translation

通过监督微调和人类反馈强化学习进行的安全调优显著提高了大型语言模型（LLMs）的鲁棒性。然而，这种方法往往抑制而非消除不安全行为，导致在输出分布的长尾中隐藏着稀有但关键的失败。虽然大多数红队工作强调对抗性提示搜索（输入空间优化），我们表明，安全失败也可以通过对固定安全关键提示进行多样化响应生成（输出空间探索）系统性地揭示，其中增加采样响应的数量和多样性可以使越狱成功率接近1。为了高效地发现这些失败，我们提出了渐进式多样化种群采样（Progressive Diverse Population Sampling, PDPS），该方法结合了随机令牌级采样与关注多样性的选择，以探索大量候选响应池并保留一个紧凑的、语义上多样的子集。在多个越狱基准测试和开源LLMs上，PDPS实现的攻击成功率可与大规模独立同分布（IID）采样相媲美，同时仅使用8%到29%的计算成本。在有限响应设置下，其成功率比IID采样和多样化束搜索提高了26%到40%。此外，PDPS生成的响应表现出更高数量和更大多样性的安全输出，证明了其在揭示更广泛失败方面的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2603.14400

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

通过序数惊讶曲线和熵扩展最小对在应用领域中的应用

Katz, Andrew

Abstract

The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.

Chinese Translation

最小对范式比较模型概率以对比完整性已被证明对评估语言模型中的语言知识很有用，但其应用主要限于语法现象的二元语法判断。此外，基于标准提示的评估需要昂贵的文本生成，可能引发事后合理化而非模型判断，并且丢弃了模型不确定性的信息。我们通过将基于惊讶的评估从二元语法对比扩展到多个领域的序数标定和评分任务，解决了这两个局限性。我们不是要求模型生成答案，而是测量它们对评级尺度（例如1-5或1-9）上每个位置所赋予的信息论“惊讶”（负对数概率），从而生成完整的惊讶曲线，显示模型的优选反应和通过熵表现出的不确定性。我们在四个领域探索这一框架：社会生态技术系统分类、因果陈述识别（包括二元和序数）、比喻语言检测和推理性质编码。在这些领域，惊讶曲线产生可解释的分类信号，明显在预期的序数尺度位置附近达到最小值，而对完成的熵往往能够区分真正模糊的条目与更容易的条目。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2603.14410

BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation

BiT-MCTS：一种基于主题的双向蒙特卡洛树搜索方法用于中文小说生成

Li, Zhaoyi, Zhang, Xu, Wan, Xiaojun

Abstract

Generating long-form linear fiction from open-ended themes remains a major challenge for large language models, which frequently fail to guarantee global structure and narrative diversity when using premise-based or linear outlining approaches. We present BiT-MCTS, a theme-driven framework that operationalizes a "climax-first, bidirectional expansion" strategy motivated by Freytag's Pyramid. Given a theme, our method extracts a core dramatic conflict and generates an explicit climax, then employs a bidirectional Monte Carlo Tree Search (MCTS) to expand the plot backward (rising action, exposition) and forward (falling action, resolution) to produce a structured outline. A final generation stage realizes a complete narrative from the refined outline. We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones. Results show that BiT-MCTS improves narrative coherence, plot structure, and thematic depth relative to strong baselines, while enabling substantially longer, more coherent stories according to automatic metrics and human judgments.

Chinese Translation

从开放主题生成长篇线性小说仍然是大型语言模型面临的主要挑战，这些模型在使用基于前提或线性大纲的方法时，常常无法保证整体结构和叙事多样性。我们提出了BiT-MCTS，这是一种以主题驱动的框架，实施了一种受弗雷塔格金字塔启发的“高潮优先、双向扩展”策略。给定一个主题，我们的方法提取核心戏剧冲突并生成明确的高潮，然后采用双向蒙特卡洛树搜索（MCTS）向后（上升动作、背景介绍）和向前（下降动作、解决）扩展情节，以生成结构化大纲。最终生成阶段根据精炼的大纲实现完整的叙事。我们构建了一个中文主题语料库用于评估，并在三个当代大型语言模型基础上进行了广泛实验。结果表明，BiT-MCTS在叙事连贯性、情节结构和主题深度方面相较于强基线有显著提升，同时根据自动评估指标和人工判断，能够生成更长且更连贯的故事。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2603.14430

Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

创造性融合还是模仿？大型语言模型生成的中文文学中的类型特定同质性

Ma, Yuanchi, Shi, Kaize, He, Hui, Zhang, Zhihua, Lei, Zhongxiang, Qiu, Ziliang, Hu, Renfen, Liu, Jiamou

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.

Chinese Translation

大型语言模型（LLMs）在叙事生成方面展现了显著的能力。然而，它们往往生成结构同质化的故事，通常遵循重复的情节事件排列和组合，以及刻板的解决方案。本文提出了一种新颖的理论框架，通过结合普罗普叙事学和叙事功能进行分析。该框架用于分析LLMs生成的叙事文本的构成，以揭示其潜在的叙事逻辑。以中文网络文学为研究重点，我们扩展了普罗普的叙事理论，定义了适合现代网络叙事结构的34种叙事功能。我们进一步构建了一个人工标注的语料库，以支持对LLM生成文本中叙事结构的分析。实验表明，生成文本中单一叙事逻辑和严重同质化的主要原因在于当前的LLMs无法正确理解叙事功能的含义，而是坚持僵化的叙事生成范式。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2603.14443

Echoes Across Centuries: Phonetic Signatures of Persian Poets

跨越世纪的回声：波斯诗人的音韵特征

Shahnazari, Kourosh, Ayyoubzadeh, Seyed Moein, Keshtparvar, Mohammadali

Abstract

This study examines phonetic texture in Persian poetry as a literary-historical phenomenon rather than a by-product of meter or a feature used only for classification. The analysis draws on a large corpus of 1,116,306 mesras from 31,988 poems written by 83 poets, restricted to five major classical meters to enable controlled comparison. Each line is converted into a grapheme-to-phoneme representation and analyzed using six phonetic metrics: hardness, sonority, sibilance, vowel ratio, phoneme entropy, and consonant-cluster ratio. Statistical models estimate poet-level differences while controlling for meter, poetic form, and line length. The results show that although meter and form explain a substantial portion of phonetic variation, they do not eliminate systematic differences between poets. Persian poetic sound therefore appears as conditioned variation within shared prosodic structures rather than as either purely individual style or simple metrical residue. A multidimensional stylistic map reveals several recurrent phonetic profiles, including high-sonority lyric styles, hardness-driven rhetorical or epic styles, sibilant mystical contours, and high-entropy complex textures. Historical analysis indicates that phonetic distributions shift across centuries, reflecting changes in genre prominence, literary institutions, and performance contexts rather than abrupt stylistic breaks. The study establishes a corpus-scale framework for phonetic analysis in Persian poetry and demonstrates how computational phonetics can contribute to literary-historical interpretation while remaining attentive to the formal structures that shape Persian verse.

Chinese Translation

本研究将波斯诗歌中的音韵质地视为一种文学历史现象，而非仅仅是韵律的副产品或仅用于分类的特征。分析基于来自83位诗人的31,988首诗中的1,116,306个韵脚的大型语料库，限制在五种主要的古典韵律，以便进行受控比较。每一行都被转换为图形到音素的表示，并使用六个音韵指标进行分析：硬度、响亮度、嘶嘶声、元音比、音素熵和辅音簇比。统计模型在控制韵律、诗歌形式和行长度的同时，估计诗人之间的差异。结果表明，尽管韵律和形式解释了音韵变异的相当一部分，但它们并未消除诗人之间的系统性差异。因此，波斯诗歌的声音表现为在共享韵律结构内的条件变异，而非纯粹的个体风格或简单的韵律残余。多维风格图揭示了几种反复出现的音韵特征，包括高响亮度的抒情风格、以硬度驱动的修辞或史诗风格、嘶嘶声的神秘轮廓和高熵的复杂质地。历史分析表明，音韵分布在几个世纪间发生变化，反映了体裁突出、文学机构和表演背景的变化，而非突发的风格断裂。本研究建立了波斯诗歌音韵分析的语料库规模框架，并展示了计算音韵学如何为文学历史解释做出贡献，同时关注塑造波斯诗歌的形式结构。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2603.14456

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

PARSA-Bench：全面的波斯语音频语言模型基准

Kalahroodi, Mohammad Javad Ranjbar, Amini, Mohammad, Bathayan, Parmis, Faili, Heshaam, Shakery, Azadeh

Abstract

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

Chinese Translation

波斯语通过其古典诗歌、传统音乐和普遍的代码切换，提出了独特的音频理解挑战，而现有基准未能涵盖这些内容。我们介绍了PARSA-Bench（波斯语音推理与语音评估基准），这是第一个用于评估大型音频语言模型在波斯语言和文化上的基准，包含16个任务和超过8,000个样本，涵盖语音理解、旁语言分析和文化音频理解。十个任务为新引入，包括诗歌韵律和风格检测、传统波斯音乐理解以及代码切换检测。仅文本的基准在性能上始终优于音频对应的基准，这表明模型可能未能利用音频特有的信息，超出转录所提供的内容。文化基础的任务暴露出一种质的不同的失败模式：所有模型在韵律检测上的表现接近随机机会，无论规模如何，这表明韵律感知仍超出当前模型的能力。该数据集已在https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench上公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2603.14458

Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

无知识推理的提炼：可靠大型语言模型的框架

Kietkajornrit, Auksarapak, Tarifi, Jad, Asgharbeygi, Nima

Abstract

Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.

Chinese Translation

当答案依赖于最新或相互矛盾的信息时，使用大型语言模型（LLMs）进行事实寻求问答仍然不可靠。尽管增强检索和工具使用的LLMs减少了幻觉，但它们往往依赖于隐式规划，导致工具使用效率低下。我们提出了一种模块化框架，明确将规划与事实检索和答案合成分开。通过教师-学生框架训练的轻量级学生规划器生成由抽象推理步骤和可搜索的事实请求组成的结构化分解。监督信号仅包含规划痕迹和事实请求，而不提供事实答案或检索证据。在推理阶段，规划器生成计划，而经过提示工程的模块执行检索和响应合成。我们在SEAL-0上评估了所提出的框架，这是一个对搜索增强LLMs极具挑战性的基准。结果表明，与单体推理模型和基于提示的工具增强框架相比，监督规划提高了准确性和延迟，证明了显式学习的规划结构对于可靠的事实寻求LLMs至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2603.14463

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

实现可验证领域掌握和幻觉控制的工业规模保险大语言模型，无需能力权衡

Zhu, Qian, Guo, Xinnan, Huo, Jingjing, Li, Jun, Liu, Pan, Yang, Wenyan, Xu, Wanqing, Lin, Xuan

Abstract

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

Chinese Translation

将大语言模型（LLMs）适应于保险等高风险垂直领域面临重大挑战：这些场景要求严格遵循复杂的法规和商业逻辑，且对幻觉零容忍。现有方法往往遭遇能力权衡——为领域专业知识牺牲一般智能——或严重依赖于检索增强生成（RAG），而缺乏内在推理。为弥补这一差距，我们提出了INS-S1，一个专门针对保险的LLM家族，通过一种新颖的端到端对齐范式进行训练。我们的方法具有两个方法论创新：（1）一个可验证的数据合成系统，构建用于精算推理和合规的层次数据集；（2）一个渐进式SFT-RL课程框架，将动态数据退火与经过验证的推理（RLVR）和人工智能反馈（RLAIF）的协同混合相结合。通过优化数据比例和奖励信号，该框架在防止灾难性遗忘的同时，强制执行领域约束。此外，我们发布了INSEva，这是迄今为止最全面的保险基准（超过39k个样本）。广泛的实验表明，INS-S1在领域任务上实现了最先进的性能，显著超越了DeepSeek-R1和Gemini-2.5-Pro。重要的是，它保持了顶级的一般能力，并实现了创纪录的0.6%幻觉率（HHEM）。我们的结果表明，严格的领域专业化可以在不妥协一般智能的情况下实现。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2603.14473

AI Can Learn Scientific Taste

人工智能可以学习科学品味

Tong, Jingqi, Li, Mingzhe, Li, Hangcheng, Yang, Yongzhuo, Mou, Yurong, Ma, Weijie, Xi, Zhiheng, Chen, Hongji, Liu, Xiaoran, Cheng, Qinyuan, Zhang, Ming, Chen, Qiguang, Ge, Weifeng, Guo, Qipeng, Ying, Tianlei, Sun, Tianxiang, Zheng, Yining, Chen, Xinchi, Zhao, Jun, Ding, Ning, Huang, Xuanjing, Jiang, Yugang, Qiu, Xipeng

Abstract

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

Chinese Translation

伟大的科学家具有强大的判断力和前瞻性，这与我们所称的科学品味密切相关。在这里，我们使用这个术语来指代判断和提出具有高潜在影响力的研究想法的能力。然而，大多数相关研究集中在提高人工智能科学家的执行能力上，而增强人工智能的科学品味仍然未被充分探索。在本研究中，我们提出了基于社区反馈的强化学习（Reinforcement Learning from Community Feedback, RLCF），这是一种使用大规模社区信号作为监督的训练范式，并将科学品味学习表述为一个偏好建模和对齐问题。在偏好建模方面，我们在70万对高引用与低引用论文的领域和时间匹配对上训练了科学评判者（Scientific Judge）来判断想法。在偏好对齐方面，使用科学评判者作为奖励模型，我们训练了一个策略模型，科学思考者（Scientific Thinker），以提出具有高潜在影响力的研究想法。实验结果表明，科学评判者的表现优于当前最先进的大型语言模型（SOTA LLMs）（例如，GPT-5.2，Gemini 3 Pro），并且能够推广到未来年份的测试、未见领域和同行评审偏好。此外，科学思考者提出的研究想法的潜在影响力高于基线。我们的研究结果表明，人工智能可以学习科学品味，这标志着朝着实现人类水平的人工智能科学家迈出了关键一步。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2603.14486

Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

无限问题生成器：通过自主工作流可验证地扩展物理推理数据

Sharan, Aditya, Hebbale, Sriram, Kumar, Dhruv

Abstract

Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.

Chinese Translation

训练大型语言模型以进行复杂推理受到可验证高质量数据稀缺的制约。在物理等领域，标准文本增强往往会引入幻觉，而静态基准缺乏细化所需的推理痕迹。我们提出了无限问题生成器（Infinite Problem Generator, IPG），这是一个自主框架，通过公式即代码（Formula-as-Code）范式合成具有保证可解性的物理问题。与概率文本生成不同，IPG 将解决方案构建为可执行的 Python 程序，强制执行严格的数学一致性。作为概念验证，我们发布了 ClassicalMechanicsV1，这是一个高保真度的经典力学问题语料库，包含从 165 个专家种子扩展而来的 1,335 个经典力学问题。该语料库展示了高结构多样性，涵盖了 102 个独特的物理公式，每个问题的平均复杂度为 3.05 个公式。此外，我们识别出一个复杂度蓝图，展示了公式数量与验证代码长度之间的强线性相关性（$R^2 ext{ 约 } 0.95$）。这一关系确立了代码复杂度作为问题难度的精确、无代理度量，从而实现可控的课程生成。我们发布了完整的 IPG 流程、ClassicalMechanicsV1 数据集以及我们的评估报告，以支持推理密集领域的可重复研究。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2603.14525

MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

恶意意图数据集及增强虚假信息检测的语言模型免疫方法

Modzelewski, Arkadiusz, Sosnowski, Witold, Papadopulos, Eleni, Sartori, Elisa, Labruna, Tiziano, Martino, Giovanni Da San, Wierzbicki, Adam

Abstract

The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.

Chinese Translation

故意制造和传播虚假信息对公共话语构成了重大威胁。然而，现有的英语数据集和研究很少关注虚假信息背后的意图性。本研究提出了MALINT，这是第一个与专家事实核查员合作开发的人类标注英语语料库，旨在捕捉虚假信息及其恶意意图。我们利用这一新颖的语料库对12种语言模型进行基准测试，包括小型语言模型（SLMs）如BERT和大型语言模型（LLMs）如Llama 3.3，针对二元和多标签意图分类任务。此外，受到心理学和传播学中的免疫理论启发，我们研究了是否通过融入恶意意图的知识可以提高虚假信息检测的效果。为此，我们提出了基于意图的免疫方法，这是一种增强意图分析的推理方法，旨在减轻虚假信息的说服影响。在六个虚假信息数据集、五种LLMs和七种语言上的分析表明，增强意图的推理提高了零样本虚假信息检测的效果。为了支持意图感知的虚假信息检测研究，我们发布了MALINT数据集，并附上每个标注步骤的注释。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2603.14563

Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

多语言微型故事：用于训练小型语言模型的印度儿童故事合成组合语料库

Halder, Deepon, Mukherjee, Angira

Abstract

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

Chinese Translation

低资源语言的强大语言模型的开发常常受到高质量、连贯且适合领域的训练语料稀缺的制约。本文介绍了多语言微型故事数据集，这是一个大规模合成生成的儿童故事集合，涵盖17种印度语言。该语料库专为小型语言模型（Small Language Models, SLMs）的训练和评估而设计，提供简单、以叙事为驱动的文本，严格本地化为母语书写形式。我们详细描述了我们的混合策划流程，该流程利用Sarvam-M语言模型和一种新颖的组合提示工程框架进行本地生成，并结合Google翻译API进行大规模跨语言扩展。通过严格的程序过滤，我们编制了132,942个故事和超过9390万个标记，作为多语言语言建模和印度语言领域迁移学习的基础资源。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2603.14567

Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes

Top-b：自回归语言过程中的相对概率带的熵调节

Halder, Deepon, Dabre, Raj

Abstract

Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model's distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.

Chinese Translation

概率语言生成器在理论上被建模为离散随机过程，但标准解码策略（Top-k、Top-p）施加的静态截断规则未能适应自然语言的动态信息密度。这种不匹配常常迫使我们做出次优的权衡：静态界限要么对高熵的创造性生成过于严格，要么对低熵的逻辑推理过于宽松。在本研究中，我们将生成过程形式化为相对概率流形上的轨迹。我们引入了Top-b（自适应相对带采样），这是一种通过与模型分布的瞬时香农熵严格耦合的动态带宽系数来调节候选集的解码策略。我们提供了一个理论框架，证明Top-b作为尾部分布的方差最小化算子。对GPQA和GSM8K基准的实证验证表明，Top-b显著降低了生成熵和解码间方差，同时保持了竞争性的推理准确性，有效地近似了自回归生成的自我调节控制系统。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2603.14593

Parameter-Efficient Quality Estimation via Frozen Recursive Models

通过冻结递归模型实现参数高效的质量评估

Abubacar, Umar, Bauer, Roman, Kanojia, Diptesh

Abstract

Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on $8$ language pairs on a low-resource QE dataset reveal three findings. First, TRM's recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37$\times$ (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman's correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80$\times$ fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at https://github.com/surrey-nlp/TRMQE.

Chinese Translation

微型递归模型（Tiny Recursive Models，TRM）通过对共享网络的迭代优化在推理任务上取得了优异的效果。我们采用三阶段的方法研究这些递归机制是否可以转移到低资源语言的质量评估（Quality Estimation，QE）中。在一个低资源的QE数据集上，对8对语言的实验揭示了三个发现。首先，TRM的递归机制并不适用于QE。外部迭代会降低性能，而内部递归只提供有限的益处。其次，表示质量在架构选择中占主导地位。最后，冻结的预训练嵌入与微调后的性能相匹配，同时将可训练参数减少了37倍（7M 对比 262M）。使用冻结的 XLM-R 嵌入的 TRM-QE 达到了0.370的斯皮尔曼相关系数，匹配微调变体（0.369），并且超越了等深度的标准变换器（0.336）。在印地语和泰米尔语中，冻结的 TRM-QE 在可训练参数少80倍的情况下超越了 MonoTransQuest（560M参数），这表明加权共享结合冻结嵌入可以实现质量评估的参数效率。我们公开发布了代码以供进一步研究。代码可以在 https://github.com/surrey-nlp/TRMQE 获取。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2603.14602

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

$PA^3$: 基于链式思维的政策感知智能体对齐

Dipta, Shubhashis Roy, Bis, Daniel, Zhou, Kun, Wang, Lichao, Yao, Benjamin Z., Guo, Chenlei, Sarikaya, Ruhi

Abstract

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

Chinese Translation

由大型语言模型（LLMs）驱动的对话助手在工具使用任务中表现出色，但在遵循复杂的业务特定规则方面却存在困难。尽管模型可以对上下文中提供的业务规则进行推理，但为每个查询引入所有政策会导致高延迟并浪费计算资源。此外，这些冗长的提示会导致长上下文，由于“针在干草堆中”问题，损害整体性能。为了解决这些挑战，我们提出了一种多阶段对齐方法，该方法教会模型在推理时的链式思维推理过程中回忆和应用相关的业务政策，而无需在上下文中包含完整的业务政策。此外，我们引入了一种基于Jaccard得分的新颖的PolicyRecall奖励以及用于GRPO训练的幻觉惩罚。总体而言，我们的最佳模型比基线提高了16个点，并且在使用40%更少的词汇的情况下，超越了相似模型规模的可比上下文基线3个点。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2603.14672

Seamless Deception: Larger Language Models Are Better Knowledge Concealers

无缝欺骗：更大语言模型更擅长隐瞒知识

Ashok, Dhananjay, Armstrong, Ruth-Ann, May, Jonathan

Abstract

Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.

Chinese Translation

语言模型（LMs）可能会获取有害知识，但在审计时却假装对此类主题无知。受到最近发现的语言模型中与欺骗相关的行为模式的启发，我们旨在训练分类器，以检测语言模型何时主动隐瞒知识。对较小模型的初步研究结果表明，分类器能够比人类评估者更可靠地检测隐瞒行为，其中基于梯度的隐瞒比基于提示的方法更易于识别。然而，与之前的研究相反，我们发现分类器在未见过的模型架构和隐藏知识主题上并不能可靠地泛化。更令人担忧的是，随着模型规模的增加，与隐瞒相关的可识别痕迹变得越来越微弱，对于任何超过700亿参数的模型，分类器的表现甚至不如随机猜测。我们的结果揭示了仅依赖黑箱审计语言模型的一个关键局限性，并强调了开发强大方法以检测主动隐藏其所包含知识的模型的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2603.14674

Computational Analysis of Semantic Connections Between Herman Melville Reading and Writing

赫尔曼·梅尔维尔阅读与写作之间语义联系的计算分析

Habib, Nudrat, Smith, Elisa Barney, Smith, Steven Olsen

Abstract

This study investigates the potential influence of Herman Melville reading on his own writings through computational semantic similarity analysis. Using documented records of books known to have been owned or read by Melville, we compare selected passages from his works with texts from his library. The methodology involves segmenting texts at both sentence level and non-overlapping 5-gram level, followed by similarity computation using BERTScore. Rather than applying fixed thresholds to determine reuse, we interpret precision, recall, and F1 scores as indicators of possible semantic alignment that may suggest literary influence. Experimental results demonstrate that the approach successfully captures expert-identified instances of similarity and highlights additional passages warranting further qualitative examination. The findings suggest that semantic similarity methods provide a useful computational framework for supporting source and influence studies in literary scholarship.

Chinese Translation

本研究通过计算语义相似性分析，探讨赫尔曼·梅尔维尔的阅读对其自身写作的潜在影响。我们使用梅尔维尔已知拥有或阅读的书籍的文献记录，将其作品中的选段与其图书馆中的文本进行比较。该方法涉及在句子级别和不重叠的5-gram级别对文本进行分段，随后使用BERTScore进行相似性计算。我们并未应用固定阈值来确定重用，而是将精确度、召回率和F1分数解释为可能的语义对齐指标，这可能暗示文学影响。实验结果表明，该方法成功捕捉到专家识别的相似性实例，并突出了其他值得进一步定性研究的段落。研究结果表明，语义相似性方法为支持文学研究中的来源和影响研究提供了有用的计算框架。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2603.14712

Towards Next-Generation LLM Training: From the Data-Centric Perspective

面向下一代大语言模型训练：数据中心视角

Liang, Hao, Zhao, Zhengyang, Han, Zhaoyang, Qiang, Meiyi, Ma, Xiaochen, Zeng, Bohan, Cai, Qifeng, Li, Zhiyu, Tang, Linpeng, E, Weinan, Zhang, Wentao

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.

Chinese Translation

大型语言模型（LLMs）在广泛的任务和领域中展现了卓越的性能，其中数据在推动这些进展中发挥了核心作用。尽管取得了这样的成功，准备和有效利用大规模数据集以进行LLM训练仍然是主要瓶颈。在当前的实践中，LLM训练数据通常是通过临时脚本构建的，且仍缺乏成熟的基于代理的数据准备系统，这些系统能够自动构建稳健且可重用的数据工作流程，从而使数据科学家免于重复且容易出错的工程工作。此外，一旦收集到数据集，通常在训练过程中会几乎全部使用，而缺乏系统的数据选择、混合优化或重加权机制。为了解决这些局限性，我们倡导两个互补的研究方向。首先，我们建议构建一个稳健的基于代理的自动数据准备系统，支持自动化工作流程构建和可扩展的数据管理。其次，我们主张建立一个统一的数据-模型交互训练系统，在该系统中，数据在整个训练过程中动态选择、混合和重加权，从而实现更高效、自适应和性能感知的数据利用。最后，我们讨论了剩余的挑战，并概述了未来研究和系统开发的有希望方向。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2603.14723

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

超越信条：一种非身份安全条件——低数据 LoRA 微调中身份框架的强实证替代方案

Zhang, Xinran

Abstract

How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.

Chinese Translation

安全监督的书写方式可能比其包含的显性身份内容更为重要。我们研究了低数据 LoRA 安全微调，采用四种监督格式，这些格式均基于相同的核心安全规则：宪法规则（A）、信条式身份框架（B）、一种与信条匹配的条件，带有世界观/信仰身份维持尾部（C），以及一个匹配的非身份条件（D）。在三个经过指令调优的模型系列（Llama 3.1 8B、Qwen2.5 7B 和 Gemma 3 4B）中，我们使用结合了 Bedrock 托管的 DeepSeek v3.2 和 Sonnet 4.6 的调和双评估管道评估 HarmBench，并手动解决了分歧和边界案例。在完整的 320 行为 HarmBench 集上，非身份条件 D 在所有三个模型系列中表现最强，Llama 达到 74.4% 的拒绝率，Gemma 达到 76.9%，Qwen 达到 74.1%。相比之下，信条式框架（B）在 Llama 和 Gemma 上优于普通宪法规则（A），但仍显著低于 D，整体描述顺序为 $D > B > C geq A > baseline$。这对身份框架假说的强版本提供了一个有限的实证挑战：显性信条式身份语言并不是这里观察到的最强增益所必需的。在 MMLU 和 ARC-Challenge 上的能力评估显示，各条件之间没有显著的权衡。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2603.14755

Learning Constituent Headedness

学习成分主导性

Qi, Zeyao, Chen, Yige, Lim, KyungTae, Pan, Haihua, Park, Jungyeul

Abstract

Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.

Chinese Translation

主导性在句法分析中被广泛用作组织工具，然而成分树库很少明确编码这一点，大多数处理流程通过渗透规则程序性地恢复它。我们将成分主导性的概念视为一个明确的表征层，并将其作为一个监督预测任务进行学习，基于对齐的成分和依赖注释，通过将每个成分头定义为依赖跨度头来引入监督。在对齐的英语和中文数据上，所得到的模型实现了接近极限的内在准确性，并显著超越了基于Collins风格规则的渗透。预测的头在基于头驱动的二元化下产生了可比的解析准确性，这与诱导的二元训练目标在头选择上大致等价，同时提高了确定性成分到依赖转换的准确性，并在简单标签映射接口下跨资源和语言进行转移。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2603.14756

Towards Privacy-Preserving Machine Translation at the Inference Stage: A New Task and Benchmark

面向推理阶段的隐私保护机器翻译：一项新任务及基准

Shao, Wei, Liu, Lemao, Li, Yinqiao, Huang, Guoping, Shi, Shuming, Song, Linqi

Abstract

Current online translation services require sending user text to cloud servers, posing a risk of privacy leakage when the text contains sensitive information. This risk hinders the application of online translation services in privacy-sensitive scenarios. One way to mitigate this risk for online translation services is introducing privacy protection mechanisms targeting the inference stage of translation models. However, compared to subfields of NLP like text classification and summarization, the machine translation research community has limited exploration of privacy protection during the inference stage. There is no clearly defined privacy protection task for the inference stage, dedicated evaluation datasets and metrics, and reference benchmark methods. The absence of these elements has seriously constrained researchers' in-depth exploration of this direction. To bridge this gap, this paper proposes a novel "Privacy-Preserving Machine Translation" (PPMT) task, aiming to protect the private information in text during the model inference stage. For this task, we constructed three benchmark test datasets, designed corresponding evaluation metrics, and proposed a series of benchmark methods as a starting point for this task. The definition of privacy is complex and diverse. Considering that named entities often contain a large amount of personal privacy and commercial secrets, we have focused our research on protecting only the named entity's privacy in the text. We expect this research work will provide a new perspective and a solid foundation for the privacy protection problem in machine translation.

Chinese Translation

当前的在线翻译服务需要将用户文本发送到云服务器，这在文本包含敏感信息时会带来隐私泄露的风险。这一风险阻碍了在线翻译服务在隐私敏感场景中的应用。缓解这一风险的一种方法是为翻译模型的推理阶段引入隐私保护机制。然而，与文本分类和摘要等自然语言处理（NLP）子领域相比，机器翻译研究社区在推理阶段的隐私保护方面的探索有限。目前尚未明确界定推理阶段的隐私保护任务，也缺乏专门的评估数据集和指标，以及参考基准方法。这些要素的缺失严重限制了研究人员对这一方向的深入探索。为了填补这一空白，本文提出了一项新颖的“隐私保护机器翻译”（Privacy-Preserving Machine Translation, PPMT）任务，旨在保护模型推理阶段文本中的私人信息。针对该任务，我们构建了三个基准测试数据集，设计了相应的评估指标，并提出了一系列基准方法作为该任务的起点。隐私的定义复杂且多样。考虑到命名实体通常包含大量个人隐私和商业秘密，我们的研究专注于仅保护文本中命名实体的隐私。我们期望这项研究工作能够为机器翻译中的隐私保护问题提供新的视角和坚实的基础。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2603.14779

Vietnamese Automatic Speech Recognition: A Revisit

越南语自动语音识别：再探

Vu, Thi, Nguyen, Linh The, Nguyen, Dat Quoc

Abstract

Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm-ai-research/PhoASR.

Chinese Translation

自动语音识别（ASR）的性能在很大程度上依赖于大规模、高质量数据集的可用性。对于低资源语言，现有的开源ASR数据集往往存在质量不足和标注不一致的问题，阻碍了稳健模型的发展。为了解决这些挑战，我们提出了一种新颖且具有普适性的数据信息聚合和预处理管道，旨在从多样化的、潜在噪声的开源来源构建高质量的ASR数据集。我们的管道包含严格的处理步骤，以确保数据的多样性、平衡性，并纳入诸如词级时间戳等关键特征。我们通过将该方法应用于越南语，展示了其有效性，最终构建了一个统一的高质量500小时数据集，为训练和评估最先进的越南语ASR系统提供了基础。我们的项目页面可访问 https://github.com/qualcomm-ai-research/PhoASR。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2603.14782

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

语言变体中的信息不对称：以粤语-普通话和巴伐利亚德语问答为案例研究

Pei, Renhao, Peng, Siyao, Blaschke, Verena, Litschko, Robert, Plank, Barbara

Abstract

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

Chinese Translation

大型语言模型（LLMs）正成为人们寻求知识的常见方式，但它们的覆盖范围和可靠性差异很大。尤其对于地方语言变体，存在较大的不对称性，例如，地方维基百科中的信息在标准变体中缺失。然而，目前对于LLMs在这种信息不对称下的表现知之甚少，尤其是在密切相关的语言之间。我们手动构建了一个新的挑战性问答（QA）数据集，捕捉了地方维基百科页面上传达的知识，而这些知识在其高资源对应版本中缺失——涵盖普通话与粤语以及德语与巴伐利亚德语。我们的实验表明，LLMs无法回答仅在地方维基百科版本中存在的信息问题。提供来自引言部分的上下文显著提高了性能，通过翻译还可能获得进一步的提升。我们的主题、地理注释和分层评估揭示了地方维基百科版本作为区域和全球信息来源的有效性。这些发现引发了关于LLMs的包容性和文化覆盖的重要问题。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2603.14838

The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

意识形态话语在检索增强生成中的影响：以COVID-19治疗为案例研究

Salari, Elmira, Delfino, Maria Claudia Nunes, Amamou, Hazem, de Souza, José Victor, Kshirsagar, Shruti, Davoust, Alan, Avila, Anderson

Abstract

This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs' responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs' responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs' outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.

Chinese Translation

本论文研究了检索到的意识形态文本对大型语言模型（LLMs）输出的影响。尽管近年来对理解LLMs中的意识形态的兴趣有所增加，但在检索增强生成（RAG）的背景下对此问题关注较少。为填补这一空白，我们设计了一个基于关于COVID-19治疗的意识形态文本的外部知识源。我们的语料库基于1,117篇学术文章，代表了关于该疾病争议性和被认可治疗的讨论。我们提出了一种基于词汇多维分析（Lexical Multidimensional Analysis, LMDA）的语料库语言学框架，以识别语料库中的意识形态。LLMs被要求回答源自三个已识别意识形态维度的问题，并采用两种类型的上下文提示：第一种包括用户问题和意识形态文本；第二种包含问题、意识形态文本和LMDA描述。通过余弦相似度评估参考意识形态文本与LLMs响应之间的意识形态一致性，涉及词汇和语义表示。结果表明，基于检索到的意识形态文本的LLMs响应与外部知识中遇到的意识形态更为一致，增强的提示进一步影响了LLMs的输出。我们的研究结果强调了在RAG框架内识别意识形态话语的重要性，以减轻不仅是无意的意识形态偏见，还有恶意操控此类模型的风险。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2603.14843

ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

ContiGuard：针对不断演变的规避扰动的持续毒性检测框架

Kang, Hankun, Miao, Xin, Chen, Jianhao, Wen, Jintao, Xu, Mayi, Zhang, Weiyu, Lu, Wenpeng, Qian, Tieyun

Abstract

Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...

Chinese Translation

毒性检测有助于减缓有害内容（例如，仇恨评论、帖子和在线社交活动中的消息）的传播，以维护健康的在线社交环境。然而，恶意用户持续开发规避扰动以伪装有毒内容并逃避检测器。传统的检测器或方法随着时间的推移是静态的，无法有效应对这些不断演变的规避策略。因此，持续学习成为一种合理的方法，以动态更新对演变扰动的检测能力。然而，扰动之间的差异阻碍了检测器在扰动文本上的持续学习。更重要的是，扰动引起的噪声扭曲了语义，降低了理解力，还影响了关键特征的学习，使得检测对扰动敏感。这加大了针对不断演变的扰动进行持续学习的挑战。在本研究中，我们提出了ContiGuard，这是第一个专门为检测器的持续学习而设计的框架，针对时间演变的扰动文本（称为持续毒性检测），使得检测器能够持续更新能力，保持对不断演变的扰动的持续抗性。具体而言，为了增强理解能力，我们提出了一种基于大型语言模型（LLM）的语义增强策略，通过动态地将由LLM挖掘的可能含义和与毒性相关的线索纳入扰动文本，以改善理解。为了减小非关键特征的影响并增强关键特征，我们提出了一种基于可分辨性的特征学习策略，通过加强区分性特征并抑制非区分性特征，以形成一个强健的分类边界来进行检测。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2603.14864

Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

购物助手：一种用于现实世界电子商务任务的记忆增强大型语言模型代理

Yu, Zijian, Xiao, Kejun, Zhao, Huaipeng, Luo, Tao, Zeng, Xiaoyi

Abstract

In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.

Chinese Translation

在电子商务中，大型语言模型（LLM）代理在推荐、预算和捆绑交易等购物任务中展现出潜力，其中准确捕捉用户在长期对话中的偏好至关重要。然而，有两个挑战阻碍了这一潜力的实现：（1）缺乏评估长期偏好感知购物任务的基准；（2）由于现有设计将偏好识别和购物辅助视为独立组件，导致缺乏端到端优化。本文介绍了一种新颖的基准，采用长期记忆设置，涵盖了两个购物任务，涉及120万个真实产品，并提出了购物助手（Shopping Companion），一个统一框架，联合处理记忆检索和购物辅助，同时支持用户干预。为了训练这些能力，我们开发了一种双重奖励强化学习策略，使用工具级奖励来处理多轮交互中固有的稀疏和不连续奖励。实验结果表明，即使是最先进的模型（如GPT-5）在我们的基准上成功率也低于70%，突显了该领域的重大挑战。值得注意的是，我们的轻量级LLM在购物助手的训练下，始终优于强基线，取得了更好的偏好捕捉和任务表现，验证了我们统一设计的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2603.14873

Developing an English-Efik Corpus and Machine Translation System for Digitization Inclusion

开发英语-埃菲克语语料库及机器翻译系统以促进数字化包容性

Edet, Offiong Bassey, Awak, Mbuotidem Sunday, Oyo-Ita, Emmanuel, Nyong, Benjamin Okon, Bassey, Ita Etim

Abstract

Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English-Efik translation, leveraging a small-scale, community-curated parallel corpus of 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English-Efik and 31.21 for Efik-English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.

Chinese Translation

低资源语言作为人类历史的宝贵载体，保存着文化和智力的多样性。尽管它们的重要性不容忽视，但在现代自然语言处理系统中仍然大多缺席。虽然在斯瓦希里语、约鲁巴语和阿姆哈拉语等广泛使用的非洲语言上取得了一定进展，但像埃菲克语这样的较小土著语言在机器翻译研究中仍然代表性不足。本研究评估了最先进的多语言神经机器翻译模型在英语-埃菲克语翻译中的有效性，利用一个由社区策划的小规模平行语料库，包含13,865对句子。我们对mT5多语言模型和NLLB-200模型进行了微调。NLLB-200的表现优于mT5，在英语-埃菲克语的BLEU分数为26.64，埃菲克语-英语的BLEU分数为31.21，对应的chrF分数为51.04和47.92，表明流畅性和语义忠实度有所提高。我们的研究结果表明，为低资源语言开发实用的机器翻译工具是可行的，并强调了包容性数据实践和文化基础评估在推动公平自然语言处理中的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2603.14891

Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

基于决策层次的有序建模用于大型语言模型的多模态作文评分

Zhang, Han, Su, Jiamin, liu, Li

Abstract

Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.

Chinese Translation

自动化作文评分（AES）为每篇作文预测多个基于评分标准的特征分数，其中每个特征遵循有序离散评分尺度。大多数基于大型语言模型（LLM）的AES方法将评分视为自回归标记生成，并通过解码和解析获得最终分数，使得决策变得隐性。这种表述在多模态AES中特别敏感，因为视觉输入在不同作文和特征中的有效性各不相同。为了解决这些局限性，我们提出了决策层次有序建模（DLOM），通过重用语言模型的头部提取预定义评分标记上的分数逻辑，使得评分成为一个显式的有序决策，从而实现直接的优化和分析。在多模态AES中，DLOM-GF引入了一个门控融合模块，能够自适应地结合文本和多模态的分数逻辑。对于仅文本的AES，DLOM-DA添加了一个距离感知的正则化项，以更好地反映有序距离。在多模态EssayJudge数据集上的实验表明，DLOM在各评分特征上优于基于生成的SFT基线，并且在模态相关性异质时，DLOM-GF进一步提升了性能。在仅文本的ASAP/ASAP++基准测试中，DLOM在没有视觉输入的情况下依然有效，而DLOM-DA进一步提高了性能，并超越了强有力的代表性基线。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2603.14893

LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

大型语言模型作为信号探测器：灵敏度、偏差与温度标准类比

Cacioli, Jon-Paul

Abstract

Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model's ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.

Chinese Translation

大型语言模型（LLMs）的校准评估使用了如期望校准误差（Expected Calibration Error）等指标，这些指标混合了两个不同的组成部分：模型区分正确与错误答案的能力（灵敏度）和其倾向于自信或谨慎回答的特性（偏差）。信号检测理论（Signal Detection Theory, SDT）对这些组成部分进行了分解。尽管基于SDT的指标如曲线下面积（AUROC）越来越多地被使用，但完整的参数框架——不等方差模型拟合、标准估计、z-ROC分析——尚未应用于将LLMs视为信号探测器的研究。在这项预注册的研究中，我们将三种LLMs视为观察者，在168,000次试验中进行事实区分，并测试温度是否作为标准偏移，类似于人类心理物理学中的收益操控。关键是，这种类比可能会失效，因为温度不仅改变了生成的答案本身，还改变了对其的信心。我们的结果确认了这种失效，温度同时提高了灵敏度（AUC）并改变了标准。所有模型都表现出不等方差的证据分布（z-ROC斜率为0.52-0.84），其中指导模型显示出比基础模型（0.77-0.87）或人类识别记忆（约0.80）更极端的非对称性（0.52-0.63）。SDT分解揭示了在灵敏度-偏差空间中占据不同位置的模型无法仅通过校准指标区分，表明完整的参数框架提供了现有指标无法获得的诊断信息。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2603.14903

ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

ExPosST：基于适应性掩蔽的显式定位用于大语言模型的同步机器翻译

Shang, Yuzhe, Gao, Pengzhi, Yang, Yazheng, Ma, Jiayao, Liu, Wei, Luan, Jian, Su, Jingsong

Abstract

Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.

Chinese Translation

大型语言模型（LLMs）最近在同步机器翻译（SimulMT）中展示了良好的性能。然而，将仅解码器的LLMs应用于SimulMT会引入位置不匹配，从而导致解码效率与位置一致性之间的困境。现有方法通常依赖于特定的位置编码或精心设计的提示方案，因此未能同时实现推理效率、位置一致性和广泛的模型兼容性。在本研究中，我们提出了ExPosST，一个通过显式位置分配来解决这一困境的通用框架。ExPosST为传入的源标记保留固定的位置槽，从而在不同的位置编码方法中实现高效解码，并利用KV缓存。为了进一步弥合微调与推理之间的差距，我们引入了一种策略一致的微调策略，使训练与推理时的解码行为保持一致。跨多个语言对的实验表明，ExPosST有效支持在多样化策略下的同步翻译。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2603.14987

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

超越基准岛屿：面向代理人工智能的代表性可信度评估

Qi, Jinhu, Li, Yifan, Zhao, Minghao, Zhang, Wentao, Zhang, Zijian, Li, Yaoman, King, Irwin

Abstract

As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent's trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity -- particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at https://github.com/TonyQJH/haaf-pilot.

Chinese Translation

随着代理人工智能系统从静态问答向开放式、工具增强和多步骤的现实工作流程发展，其增强的权威性带来了更大的系统误用和操作失败风险。然而，目前的评估实践仍然是碎片化的，仅在狭窄定义的环境中测量孤立的能力，如编码、幻觉、越狱抵抗或工具使用。我们认为，主要的限制不仅仅是评估维度覆盖不足，而是缺乏一个原则性的代表性概念：代理的可信度应在一个代表性的社会技术场景分布中进行评估，而不是在一系列不相关的基准实例中。为此，我们提出了全息代理评估框架（Holographic Agent Assessment Framework，HAAF），这是一个系统的评估范式，表征代理在跨越任务类型、工具接口、交互动态、社会背景和风险水平的场景流形上的可信度。该框架整合了四个互补组件：（i）静态认知和政策分析，（ii）交互沙盒模拟，（iii）社会伦理对齐评估，以及（iv）一个关注分布的代表性采样引擎，该引擎共同优化覆盖率和风险敏感性——特别是对于传统基准系统性忽视的稀有但高后果的尾部风险。这些组件通过一个迭代的可信优化工厂相连接。通过红队探测和蓝队加固的循环，这一范式逐步缩小脆弱性，以满足部署标准，将代理评估从基准岛屿转向代表性的现实世界可信度。示例实例的代码和数据可在 https://github.com/TonyQJH/haaf-pilot 获取。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2603.14997

OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

OrgForge：一个可验证的合成企业语料库的多智能体模拟框架

Flynt, Jeffrey

Abstract

Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across documents.We present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.

Chinese Translation

评估检索增强生成（RAG）管道需要具备可知的真实情况、时间结构化以及跨文档属性的语料库，而这些特性在现实世界数据集中很少能够干净地提供。现有资源如恩龙语料库存在法律模糊性、人口统计偏差，并且没有结构化的真实情况。纯粹由大语言模型（LLM）生成的合成数据解决了法律问题，但引入了一个更微妙的问题：生成模型无法防止在文档之间产生自相矛盾的事实。我们提出了OrgForge，一个开源的多智能体模拟框架，强制执行严格的物理-认知边界：一个确定性的Python引擎维护一个SimEvent真实情况总线；大型语言模型仅生成表面文本，受经过验证的提案约束。一个演员本地时钟在所有文档类型中强制因果时间戳的正确性，消除了在每个文档独立采样时间戳时产生的时间线不一致性。我们形式化了三个图动态子系统，通过介数中心性、时间边权衰减和Dijkstra升级路由来传播压力，这些子系统独立于任何LLM来治理组织行为。运行一个可配置的N天模拟，OrgForge生成交错的Slack线程、JIRA票据、Confluence页面、Git拉取请求和电子邮件，所有这些都可以追溯到一个共享的、不可变的事件日志。我们还描述了一个因果链追踪子系统，该子系统为每个事件积累跨文档证据图，一个混合的互惠排名融合重复检测器用于识别重复的故障类别，以及一个进出邮件引擎，通过有条件的因果链路路由供应商警报、客户投诉和人力资源通信，并进行概率性丢失模拟。OrgForge在MIT许可证下提供。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2603.15005

Pretraining and Benchmarking Modern Encoders for Latvian

拉脱维亚语现代编码器的预训练与基准测试

Znotins, Arturs

Abstract

Encoder-only transformers remain essential for practical NLP tasks. While recent advances in multilingual models have improved cross-lingual capabilities, low-resource languages such as Latvian remain underrepresented in pretraining corpora, and few monolingual Latvian encoders currently exist. We address this gap by pretraining a suite of Latvian-specific encoders based on RoBERTa, DeBERTaV3, and ModernBERT architectures, including long-context variants, and evaluating them across a diverse set of Latvian diagnostic and linguistic benchmarks. Our models are competitive with existing monolingual and multilingual encoders while benefiting from recent architectural and efficiency advances. Our best model, lv-deberta-base (111M parameters), achieves the strongest overall performance, outperforming larger multilingual baselines and prior Latvian-specific encoders. We release all pretrained models and evaluation resources to support further research and practical applications in Latvian NLP.

Chinese Translation

仅编码器的变换器在实际自然语言处理任务中仍然至关重要。尽管多语言模型的最新进展提高了跨语言能力，但像拉脱维亚语这样的低资源语言在预训练语料库中仍然代表性不足，目前存在的单语拉脱维亚编码器也很少。我们通过基于RoBERTa、DeBERTaV3和ModernBERT架构的拉脱维亚特定编码器的预训练来填补这一空白，包括长上下文变体，并在一系列多样化的拉脱维亚诊断和语言基准上对其进行评估。我们的模型在与现有的单语和多语编码器的比较中表现出竞争力，同时受益于最新的架构和效率进展。我们的最佳模型lv-deberta-base（111M参数）在整体性能上表现最强，超越了更大规模的多语言基线和之前的拉脱维亚特定编码器。我们发布所有预训练模型和评估资源，以支持拉脱维亚自然语言处理领域的进一步研究和实际应用。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2603.15031

Attention Residuals

注意力残差

Kimi Team, Chen, Guangyu, Zhang, Yu, Su, Jianlin, Xu, Weixin, Pan, Siyuan, Wang, Yaoyu, Wang, Yucheng, Chen, Guanduo, Yin, Bohong, Chen, Yutian, Yan, Junjie, Wei, Ming, Zhang, Y., Meng, Fanqing, Hong, Chao, Xie, Xiaotong, Liu, Shaowei, Lu, Enzhe, Tai, Yunpeng, Chen, Yanru, Men, Xin, Guo, Haiqing, Charles, Y., Lu, Haoyu, Sui, Lin, Zhu, Jinguo, Zhou, Zaida, He, Weiran, Huang, Weixiao, Xu, Xinran, Wang, Yuzhi, Lai, Guokun, Du, Yulun, Wu, Yuxin, Yang, Zhilin, Zhou, Xinyu

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

Chinese Translation

预规范化的残差连接在现代大规模语言模型（LLMs）中是标准配置，然而它们以固定的单位权重累积所有层的输出。这种均匀聚合导致随着深度的增加，隐藏状态的无控制增长，逐渐稀释了每一层的贡献。我们提出了注意力残差（Attention Residuals，AttnRes），它用前一层输出的 Softmax 注意力替代了固定的累积方式，使得每一层能够选择性地以学习的、依赖输入的权重聚合早期表示。为了解决在大规模模型训练中对所有前一层输出进行注意的内存和通信开销，我们引入了块注意力残差（Block AttnRes），它将层分割成块并在块级表示上进行注意，减少了内存占用，同时保留了完整 AttnRes 大部分的收益。结合基于缓存的管道通信和两阶段计算策略，Block AttnRes 成为标准残差连接的实用替代方案，且开销极小。缩放法实验确认这种改进在各种模型规模中保持一致，消融实验验证了基于内容的深度选择的好处。我们进一步将 AttnRes 集成到 Kimi 线性架构中（总计 48B / 激活参数 3B），并在 1.4T 令牌上进行预训练，其中 AttnRes 缓解了预规范化稀释，导致跨层更均匀的输出幅度和梯度分布，并提高了在所有评估任务上的下游性能。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2603.15034

Interpretable Predictability-Based AI Text Detection: A Replication Study

可解释的基于可预测性的人工智能文本检测：一项复制研究

Skurla, Adam, Macko, Dominik, Simko, Jakub

Abstract

This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model's decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.

Chinese Translation

本文复制并扩展了在AuTexTification 2023共享任务中用于机器生成文本作者归属的系统。首先，我们尝试重现原始结果。由于数据划分、模型可用性和实现细节的差异，完全复制是不可能的。接下来，我们测试了更新的多语言语言模型，并添加了26个文档级风格特征。我们还应用了SHAP分析，以检查哪些特征影响模型的决策。我们用更新的生成模型如Qwen和mGPT替换了原始的GPT-2模型，以计算概率特征。对于上下文表示，我们使用了mDeBERTa-v3-base，并对英语和西班牙语应用了相同的配置。这使我们能够为子任务1和子任务2使用一个共享配置。我们的实验表明，额外的风格特征在两个任务和两种语言中均提高了性能。多语言配置的结果与语言特定模型相当或更好。研究还表明，清晰的文档对于可靠的复制和系统的公平比较至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2603.15051

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

潜在思维：大语言模型中隐式推理的自适应锚点精炼

Sheshanarayana, Disha, Pal, Rajat Subhra, Sinha, Manjira, Dasgupta, Tirthankar

Abstract

Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.

Chinese Translation

基于标记的思维链（Chain-of-Thought, CoT）提示已成为在大型语言模型（Large Language Models, LLMs）中引导多步骤推理的标准方法，尤其是在数学文字问题上。然而，生成较长的中间推理过程会增加输出长度和推理成本，并且在模型能够在没有大量表述的情况下得出正确答案时，这种方法可能效率低下。这促使了潜在空间推理方法的出现，这些方法将计算转移到隐藏表示中，仅输出最终答案。然而，许多潜在推理方法在推理时依赖于固定数量的潜在精炼步骤，这增加了一个超参数，必须在不同模型和数据集之间进行调优，以平衡准确性和效率。我们提出了AdaAnchor，这是一种潜在推理框架，通过精炼附加到输入的潜在锚向量集来进行无声的迭代计算。AdaAnchor进一步结合了一种自适应停止机制，该机制监控锚点在迭代过程中的稳定性，并在锚点动态收敛后终止精炼，为较简单的实例分配较少的步骤，同时为较困难的实例保留额外的精炼步骤，且在共享的最大步骤预算下进行。我们在三个数学文字问题基准上的实证评估表明，带有自适应停止的AdaAnchor在固定步骤潜在精炼的基础上，准确性提高了多达5%，同时在相同的最大步骤预算下平均潜在精炼步骤减少了48-60%。与标准推理基线相比，AdaAnchor通过将计算转移到无声的潜在精炼中，实现了生成标记数量的大幅减少（92-93%），提供了一种不同的准确性与效率的权衡，显著降低了输出标记的使用。

View on arXiv Download PDF AI Translation

cs.CL / 81 / 2603.15061

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Writer-R1：通过记忆增强重放策略优化提升大语言模型的生成写作能力

Zhao, Jihao, Zu, Shuaishuai, Ji, Zhiyuan, Zhou, Chunlai, Qin, Biao

Abstract

As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.

Chinese Translation

作为一种典型的开放式生成任务，创意写作缺乏可验证的参考答案，这长期限制了奖励建模和自动评估，因为人类注释成本高、评估偏差以及反馈信号粗糙。为了解决这些挑战，本文首先基于扎根理论设计了一种多智能体协作工作流程，对问题进行维度分解和层次归纳，以动态生成可解释和可重用的细粒度标准。此外，我们提出了记忆增强重放策略优化（Memory-augmented Replay Policy Optimization, MRPO）算法：一方面，MRPO在不进行额外训练的情况下，引导模型基于动态标准进行自我反思，实现受控的迭代改进；另一方面，我们采用结合监督微调与强化学习的训练范式，将评估标准转化为奖励信号，实现端到端优化。实验结果表明，自动构建的标准在性能上与人类注释相当。采用该方法训练的Writer-R1-4B模型在多个创意写作任务中表现优于基线，并超越了一些参数超过100B的开源模型。

View on arXiv Download PDF AI Translation

cs.CL / 82 / 2603.15094

Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies

连接国家与国际法律数据：基于日本法律标准XML模式的两个比较法研究项目

Nakamura, Makoto

Abstract

This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.

Chinese Translation

本文提出了一个用于计算比较法的综合框架，通过连接两个基于日本法律标准（Japanese Legal Standard, JLS）XML模式的连续研究项目。第一个项目通过开发从JLS到Akoma Ntoso（AKN）标准的转换管道，建立了结构互操作性，使得日本法律法规能够整合到基于LegalDocML的国际立法数据库中。在此基础上，第二个项目应用多语言嵌入模型和语义文本相似性技术，以识别各国法律体系中对应的条款。一个结合多语言嵌入、FAISS检索和Cross-Encoder重排序的原型系统生成候选对应关系，并将其可视化为跨法域网络，以便进行探索性比较分析。

View on arXiv Download PDF AI Translation

cs.CL / 83 / 2603.15117

MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge

MMKU-Bench：多模态更新基准用于多样化视觉知识

Fu, Baochen, Du, Yuntao, Chang, Cheng, Jin, Baihao, Deng, Wenzhi, Xu, Muhao, Yan, Hongmei, Song, Weiye, Wan, Yi

Abstract

As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.

Chinese Translation

随着现实世界知识的不断演变，多模态模型在预训练过程中获得的参数知识越来越难以与现实世界知识保持一致。现有的多模态知识更新研究仅关注于学习先前未知的知识，而忽视了更新模型已经掌握但后来发生变化的知识的必要性；此外，评估仅限于同一模态，缺乏对跨模态一致性的系统分析。为了解决这些问题，本文提出了MMKU-Bench，这是一个全面的多模态知识更新评估基准，包含超过25,000个知识实例和49,000多张图像，涵盖了更新知识和未知知识两种场景，从而能够对不同知识类型的学习进行比较分析。在该基准上，我们评估了多种具有代表性的方法，包括监督微调（SFT）、基于人类反馈的强化学习（RLHF）和知识编辑（KE）。实验结果表明，SFT和RLHF容易出现灾难性遗忘，而KE更好地保留了通用能力，但在持续更新方面表现出明显的局限性。总体而言，MMKU-Bench为多模态知识更新提供了一个可靠且全面的评估基准，推动了该领域的进展。

View on arXiv Download PDF AI Translation

cs.CL / 84 / 2603.15130

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

英语、德语和巴伐利亚语中的间接问答：对高资源和低资源语言的挑战性任务

Winkler, Miriam, Blaschke, Verena, Plank, Barbara

Abstract

Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.

Chinese Translation

间接性是日常交流中的一个普遍特征，但在低资源和高资源语言的自然语言处理（NLP）研究中仍然未得到充分探索。间接问答（Indirect Question Answering, IQA）旨在对间接回答的极性进行分类。本文提出了两个多语言语料库，用于不同质量的IQA，涵盖英语、标准德语和巴伐利亚语（一种没有标准正字法的德语方言）：InQA+，一个小型高质量的评估数据集，具有手动标注的标签，以及GenIQA，一个更大规模的训练数据集，包含由GPT-4o-mini生成的人工数据。我们发现IQA是一项在语用上具有挑战性的任务，面临各种挑战，这些挑战基于与多语言变换模型（mBERT、XLM-R和mDeBERTa）进行的多种实验变体。我们提出并采用了应对这些挑战的建议。我们的结果显示，即使对于英语，性能也很低，并且存在严重的过拟合现象。我们分析了影响这些结果的各种因素，包括标签模糊性、标签集和数据集大小。我们发现IQA在高资源（英语、德语）和低资源语言（巴伐利亚语）中的表现都很差，并且拥有大量训练数据是有益的。此外，GPT-4o-mini在我们测试的任何语言中都没有足够的语用理解能力来生成高质量的IQA数据。

View on arXiv Download PDF AI Translation

cs.CL / 85 / 2603.15164

HindSight: Evaluating Research Idea Generation via Future Impact

HindSight: 通过未来影响评估研究创意生成

Jiang, Bo

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while \hs{} shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty ($\rho{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

Chinese Translation

评估AI生成的研究创意通常依赖于大型语言模型（LLM）评审或人类小组评审——这两者都是主观的，且与实际研究影响脱节。我们介绍了 ext{HindSight}（ ext{hs}{}），一个时间分割的评估框架，通过将生成的创意与未来真实出版物进行匹配，并根据引用影响和接收情况为其评分，从而衡量创意质量。通过设置时间截止点~$T$，我们限制创意生成系统只使用$T$之前的文献，然后将其输出与随后的30个月内发表的论文进行评估。在10个AI/机器学习研究主题上的实验揭示了一个明显的脱节：LLM作为评审时在检索增强和传统创意生成之间没有发现显著差异（$p{=}0.584$），而 ext{hs}{}显示检索增强系统生成的创意评分高出2.5倍（$p{<}0.001$）。此外， ext{hs}{}的评分与LLM评审的创新性呈 extit{负}相关（$ ho{=}{-}0.29$, $p{<}0.01$），这表明LLM系统性地高估了那些在实际研究中从未实现的听起来新颖的创意。

View on arXiv Download PDF AI Translation

cs.CL / 86 / 2603.15187

The Hrunting of AI: Where and How to Improve English Dialectal Fairness

人工智能的Hrunting：如何以及在哪里改善英语方言的公平性

Li, Wei, de Wynter, Adrian

Abstract

It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM's alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs' ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.

Chinese Translation

已知大型语言模型（LLMs）在英语方言中的表现不佳，而由于数据稀缺，改善它们的难度较大。在本研究中，我们探讨了质量和可用性如何影响在此背景下改善LLMs的可行性。为此，我们评估了三种鲜有研究的英语方言（约克郡方言、乔迪方言和康沃尔方言），以及作为对照的非洲裔美国人白话英语和西弗里西亚语。我们发现，在确定LLM生成质量时，人与人之间的协议直接影响LLM作为评判者的表现。也就是说，LLM与人类的协议模式模仿了人类之间的协议模式，准确性等指标也是如此。这是一个问题，因为LLM与人类的协议衡量了LLM与人类共识的一致性；因此，这引发了关于在低人口导致低协议的地方改善LLM性能的可行性的问题。我们还注意到，微调并未消除这种模式，反而可能在英语方言中加剧这一现象。但我们也发现了一些令人鼓舞的信号，例如某些LLMs能够生成高质量的数据，从而实现可扩展性。我们认为，必须仔细评估数据，以确保LLM的公平和包容性改善；在数据稀缺的情况下，需要新的工具来处理发现的模式。

View on arXiv Download PDF AI Translation

cs.CL / 87 / 2603.15206

Efficient Document Parsing via Parallel Token Prediction

通过并行标记预测实现高效文档解析

Li, Lei, Zhao, Ze, Li, Meng, Lun, Zhongwang, Yuan, Yi, Lu, Xingjing, Wei, Zheng, Bian, Jiang, Li, Zang

Abstract

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

Chinese Translation

文档解析作为一项基础而重要的视觉任务，正受到视觉-语言模型（VLMs）的变革。然而，VLMs固有的自回归（AR）解码造成了显著的瓶颈，严重限制了解析速度。本文提出了一种并行标记预测（PTP）的方法，这是一种可插拔、与模型无关且简单有效的方法，使VLMs能够并行生成多个未来标记，从而提高样本效率。具体而言，我们在输入序列中插入一些可学习的标记，并设计相应的训练目标，以赋予模型文档解析的并行解码能力。此外，为了支持有效的训练，我们开发了一个全面的数据生成管道，能够高效地生成大规模、高质量的文档解析训练数据供VLMs使用。在OmniDocBench和olmOCR-bench上的大量实验表明，我们的方法不仅显著提高了解码速度（1.6倍至2.2倍），还减少了模型的幻觉现象，并展现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 88 / 2603.15227

Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation

双向中英文被动句数据集用于机器翻译

Ma, Xinyue, Pastells, Pol, Farrús, Mireia, Taulé, Mariona

Abstract

Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.

Chinese Translation

机器翻译（MT）评估已超越传统指标，转向更具体的语言现象。对于英汉语言对，被动句由于语言差异而构造和分布不同，因此在机器翻译中需要特别关注。本文提出了一个双向多领域的被动句数据集，该数据集从五个中英文平行语料库中提取，并根据人工翻译自动标注结构标签，同时提供了一个经过人工验证注释的测试集。该数据集包含73,965对平行句子（2,358,731个英文单词，3,498,229个中文字符）。我们使用该数据集对两种最先进的开源机器翻译系统进行评估，并使用测试集对四种商业模型进行评估。结果表明，与人类不同，模型更受源文本语态的影响，而不是源语言的一般语态使用，因此在双向翻译中倾向于保持被动语态。然而，模型对中文被动句的低频率和主要负面语境表现出一定的了解，导致在英译中时与人类翻译者的语态一致性高于中译英时。商业神经机器翻译模型在指标评估中得分较高，但大型语言模型（LLMs）在使用多样化替代翻译方面表现出更好的能力。数据集和注释脚本将在请求时共享。

View on arXiv Download PDF AI Translation

cs.CL / 89 / 2603.15245

Practicing with Language Models Cultivates Human Empathic Communication

与语言模型的练习培养人类的共情沟通

Kumar, Aakriti, Poungpeth, Nalin, Yang, Diyi, Lambert, Bruce, Groh, Matthew

Abstract

Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants' communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.

Chinese Translation

共情是人类连接的核心，但人们常常难以有效表达它。在盲评中，大型语言模型（LLMs）生成的回应常常被评判为比人类撰写的回应更具共情性。然而，当回应被归因于人工智能时，接受者感到的被倾听和被验证的程度低于当类似回应归因于人类时。为了探讨并解决这一共情沟通技能的差距，我们构建了“倾听者”（Lend an Ear），一个实验性对话平台，参与者被要求向扮演个人和职场困扰的LLM提供共情支持。从968名参与者与他们的LLM对话伙伴之间的2,904个文本对话中收集的33,938条消息中，我们得出了自然对话中习惯性共情表达的数据驱动分类法。基于一个预注册的随机实验，我们提供了证据，表明一个简短的LLM辅导干预，提供关于如何有效沟通共情的个性化反馈，显著提升了参与者的沟通模式与规范共情沟通模式的一致性，相较于对照组和接受了基于视频但非个性化反馈的组。此外，我们发现了一个沉默的共情效应，即人们感受到共情但系统性地未能表达出来。尽管如此，参与者可靠地识别出符合规范共情沟通标准的回应更能表达共情。总之，这些结果推动了对共情如何被表达和重视的科学理解，并展示了一种可扩展的基于人工智能的干预方法，用于支撑和培养共情。

View on arXiv Download PDF AI Translation

cs.CL / 90 / 2603.15270

From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

从文档到跨度：基于代码的学习用于基于LLM的ICD编码

Zhang, Xu, Ma, Wenxin, Wu, Chenxu, Wang, Rongsheng, Zhang, Kun, Zhou, S. Kevin

Abstract

ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.

Chinese Translation

ICD编码是医疗保健中一项关键但具有挑战性的任务。最近，基于LLM（大语言模型）的方法在ICD编码中表现出比判别方法更强的泛化能力。然而，针对ICD编码对LLM进行微调面临三个主要挑战。首先，现有的公共ICD编码数据集对ICD代码空间的覆盖有限，限制了模型对未见代码的泛化能力。其次，简单的微调降低了LLM的可解释性，因为很少有公共数据集包含分配代码的明确支持证据。第三，ICD编码通常涉及长篇临床文档，使得对LLM进行微调在计算上代价高昂。为了解决这些问题，我们提出了基于代码的学习（Code-Centric Learning），这是一个将监督从完整的临床文档转移到可扩展的短证据跨度的训练框架。该框架的关键思想是，跨度级学习提高了LLM进行文档级ICD编码的能力。我们提出的框架包括混合训练策略和基于代码的数据扩展，显著降低了训练成本，提高了对未见ICD代码的准确性，并保持了可解释性。在相同的LLM基础上，我们的方法显著优于强基线。值得注意的是，我们的方法使小规模LLM能够达到与更大专有模型相当的性能，证明了其在完全自动化ICD编码中的有效性和潜力。

View on arXiv Download PDF AI Translation

cs.CL / 91 / 2603.15295

Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

跨语言动词交替的数据集：BLM 模板和数据增强策略

Samo, Giuseppe, Merlo, Paola

Abstract

Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.

Chinese Translation

大型语言模型（LLMs）在各种基于句子的语言现象中表现出显著的性能，但它们捕捉跨句子范式模式（如动词交替）的能力仍然未得到充分探索。在本研究中，我们呈现了为四种语言精心策划的基于范式的数据集，旨在探讨动词交替的系统性跨句子知识（包括英语、德语和意大利语中的状态变化和省略宾语结构，以及希伯来语的 binyanim）。这些数据集包含数千个黑鸟语言矩阵（Blackbird Language Matrices，BLMs）问题。BLM 任务是一种专为语言设计的受控语言难题，类似于 RPM/ARC 任务，模型必须选择符合句法和语义规则的句子以完成模式。我们引入了三种复杂度不同的模板，并在合成和自然数据上应用了语言学启发的数据增强策略。我们提供了英语、意大利语、德语和希伯来语的简单基线性能结果，展示了这些数据集的诊断实用性。

View on arXiv Download PDF AI Translation

cs.CL / 92 / 2603.15309

CCTU: A Benchmark for Tool Use under Complex Constraints

CCTU：复杂约束下工具使用的基准测试

Ye, Junjie, Zhang, Guoqiang, Fu, Wenjie, Gui, Tao, Zhang, Qi, Huang, Xuanjing

Abstract

Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.

Chinese Translation

在明确的约束条件下通过工具使用解决问题是大语言模型（LLMs）面临的一种高度挑战却又不可避免的场景，这需要诸如函数调用、指令跟随和自我优化等能力。然而，由于缺乏专门的评估，进展受到阻碍。为了解决这个问题，我们提出了CCTU，这是一个用于评估LLM在复杂约束下工具使用的基准。CCTU基于涵盖四个维度（即资源、行为、工具集和响应）的12类约束分类法。该基准包括200个精心策划且具有挑战性的测试案例，涉及多样的工具使用场景，每个案例平均包含七种约束类型，平均提示长度超过4700个标记。为了确保可靠的评估，我们开发了一个可执行的约束验证模块，该模块在模型与其环境之间的多轮交互中进行逐步验证并强制遵守约束。我们在思维与非思维模式下评估了九个先进的LLM。结果表明，当严格遵守所有约束时，没有任何模型的任务完成率超过20%。进一步分析显示，模型在超过50%的案例中违反约束，特别是在资源和响应维度。此外，LLMs在接收到有关约束违反的详细反馈后仍表现出有限的自我优化能力，这突显了开发强大工具使用代理的关键瓶颈。为了促进未来的研究，我们发布了数据和代码。

View on arXiv Download PDF AI Translation

cs.CL / 93 / 2603.15317

PYTHEN: A Flexible Framework for Legal Reasoning in Python

PYTHEN：一个灵活的法律推理框架（Python）

Nguyen, Ha-Thanh, Satoh, Ken

Abstract

This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python's built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.

Chinese Translation

本文介绍了PYTHEN，一个基于Python的新型可推翻法律推理框架。PYTHEN旨在模拟法律论证固有的可推翻特性，提供灵活且直观的语法来表示法律规则、条件和例外。受PROLEG（基于PROlog的法律推理支持系统）的启发，并遵循《Python之禅》的理念，PYTHEN利用Python内置的any()和all()函数，通过在单个规则中原生支持合取（ALL）和析取（ANY）条件，提供增强的灵活性，以及更具表现力的异常处理机制。本文详细描述了PYTHEN的架构，提供了与PROLEG的比较分析，并讨论了其在自动形式化和下一代法律人工智能系统开发中的潜在应用。通过弥合符号推理与Python可及性之间的差距，PYTHEN旨在使年轻研究人员、法律科技开发者以及没有广泛逻辑编程专业知识的专业人士能够民主化正式法律推理。我们将PYTHEN定位为逻辑编程强大的符号推理能力与Python丰富、普遍的生态系统之间的实用桥梁，使正式法律推理对更广泛的开发者和法律专业人士变得可及。

View on arXiv Download PDF AI Translation

cs.CL / 94 / 2603.15326

Tagarela - A Portuguese speech dataset from podcasts

Tagarela - 来自播客的葡萄牙语语音数据集

de Oliveira, Frederico Santos, Gris, Lucas Rafael Stefanel, Ferreira, Alef Iury Siqueira, da Rosa, Augusto Seben, Filho, Alexandre Costa Ferro, Casanova, Edresson, Shulby, Christopher Dane, Sousa, Rafael Teixeira, Silva, Diogo Fernandes Costa, Soares, Anderson da Silva, Filho, Arlindo Rodrigues Galvão

Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.

Chinese Translation

尽管语音处理领域取得了显著进展，由于缺乏公开的、大规模的、高质量的数据集，葡萄牙语仍然处于资源匮乏的状态。为了解决这一问题，我们提出了一个新数据集，名为 TAGARELA，由超过 8,972 小时的播客音频组成，专门用于训练自动语音识别（ASR）和文本到语音（TTS）模型。值得注意的是，其规模可与英语的 GigaSpeech（10kh）相媲美，使得建立最先进的葡萄牙语模型成为可能。为了确保数据质量，该语料库经过音频预处理流程，并随后采用混合策略进行转录：我们使用了在由专有 API 生成的高保真转录上进行过训练的 ASR 模型，从而确保了较高的初始准确性。最后，为了验证这一新资源的有效性，我们展示了仅在我们数据集上训练的 ASR 和 TTS 模型，并评估了它们的性能，展示了其在推动葡萄牙语更加稳健和自然的语音技术发展方面的潜力。该数据集公开发布，地址为 https://freds0.github.io/TAGARELA/，旨在促进稳健语音技术的发展。

View on arXiv Download PDF AI Translation

cs.CL / 95 / 2603.15340

DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

DOS：面向依赖的采样器用于掩蔽扩散语言模型

Zhou, Xueyu, Hu, Yangrong, Huang, Jian

Abstract

Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

Chinese Translation

掩蔽扩散语言模型（MDLMs）最近作为语言建模的新范式出现，提供了灵活的生成动态并实现高效的并行解码。然而，现有的预训练MDLM解码策略主要依赖于基于标记的无确定性标准，而在很大程度上忽视了序列级信息和标记间的依赖关系。为了解决这一局限性，我们提出了面向依赖的采样器（DOS），这是一种无训练的解码策略，利用标记间的依赖关系来指导生成过程中的标记更新。具体而言，DOS利用变换器块中的注意力矩阵来近似标记间的依赖关系，在更新被掩蔽位置时强调来自未掩蔽标记的信息。实证结果表明，DOS在代码生成和数学推理任务上始终表现出色。此外，DOS可以与现有的并行采样方法无缝集成，从而提高生成效率而不牺牲生成质量。

View on arXiv Download PDF AI Translation

cs.CL / 96 / 2603.15389

When Does Sparsity Mitigate the Curse of Depth in LLMs

稀疏性何时缓解大模型的深度诅咒

Muhtar, Dilxat, Song, Xinyuan, Pokutta, Sebastian, Zimmer, Max, Pelleriti, Nico, Hofmann, Thomas, Liu, Shiwei

Abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.

Chinese Translation

最近的研究表明，在大语言模型（LLMs）中存在深度诅咒现象，即后续层对学习和表示的贡献低于早期层。这种低效利用与预层归一化中的方差累积增长有关，这可以导致深层块趋向近似恒等的行为。本文证明，稀疏性不仅使效率得以提升，还作为方差传播的调节器，从而改善了深层的利用效率。我们的研究涵盖了两种稀疏性来源：（i）隐式稀疏性，源自训练和数据条件，包括由权重衰减引起的权重稀疏性和由长上下文输入引起的注意力稀疏性；（ii）显式稀疏性，由架构设计强制引入，包括分组查询注意力中的键/值共享稀疏性和混合专家中的专家激活稀疏性。我们的论点得到了控制深度缩放实验和针对性层效能干预的充分支持。在不同设置中，我们观察到一个一致的关系：稀疏性通过降低输出方差和促进功能差异化来改善层的利用率。最终，我们将这些发现提炼为一条实用的培训规则，以训练深度有效的大语言模型，在下游任务上实现了显著的4.6%的准确率提升。我们的结果揭示了自然源自标准设计选择的稀疏性，作为大模型有效深度缩放的关键但之前被忽视的机制。代码可在 https://github.com/pUmpKin-Co/SparsityAndCoD 获取。

View on arXiv Download PDF AI Translation

cs.CL / 97 / 2603.15402

A Closer Look into LLMs for Table Understanding

深入探讨大型语言模型在表格理解中的应用

Wang, Jia, Qin, Chuanyu, Zheng, Mingyu, Si, Qingyi, Li, Peize, Lin, Zheng

Abstract

Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.

Chinese Translation

尽管大型语言模型（LLMs）在表格理解方面取得了成功，但其内部机制仍然不清晰。本文对16个LLMs进行了实证研究，涵盖了通用LLMs、专业表格LLMs和专家混合模型（Mixture-of-Experts, MoE），旨在探讨LLMs如何理解表格数据并执行下游任务。我们的分析集中在四个维度，包括注意力动态、有效层深度、专家激活以及输入设计的影响。主要发现包括：（1）LLMs遵循三阶段的注意力模式——早期层广泛扫描表格，中间层定位相关单元，后期层放大其贡献；（2）表格任务需要比数学推理更深的层次才能达到稳定的预测；（3）MoE模型在中间层激活特定于表格的专家，而早期和后期层共享通用专家；（4）链式思维提示增加了对表格的注意力，进一步通过表格调优得以增强。我们希望这些发现和见解能够促进表格相关任务的可解释性和未来研究。

View on arXiv Download PDF AI Translation

cs.CL / 98 / 2603.15405

Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

Fusian：用于大语言模型中细粒度连续MBTI人格控制的多LoRA融合

Chen, Zehao, Pan, Rong

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., "Extroverted" vs. "Introverted"), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model's output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.

Chinese Translation

大型语言模型（LLMs）在模拟多样化的人类行为和个性方面展现了令人印象深刻的能力。然而，现有的人格控制方法，包括提示工程和标准监督微调（SFT），通常将人格特征视为离散类别（例如，“外向”与“内向”），缺乏在连续光谱上精确控制特征强度的能力。本文介绍了Fusian，一种用于LLMs中细粒度、连续人格控制的新框架。Fusian分为两个阶段：第一阶段是轨迹收集（Trajectory Collection），我们通过保存一系列LoRA适配器捕捉SFT过程中人格采纳的动态演变，有效地映射特征的连续流形；第二阶段是基于强化学习的动态融合（RL-based Dynamic Fusion），我们使用强化学习训练一个策略网络，以动态计算这些冻结适配器的混合权重。通过从由策略网络参数化的狄利克雷分布中采样，Fusian融合多个适配器，使模型的输出与特定数值目标强度对齐。在Qwen3-14B模型上的实验表明，Fusian在人格控制方面实现了高精度，显著优于基线方法，能够更好地与用户指定的特征强度对齐。

View on arXiv Download PDF AI Translation

cs.CL / 99 / 2603.15409

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

SEA-Vision：东南亚综合文档与场景文本理解的多语言基准

Yue, Pengfei, Zhao, Xingran, Chen, Juntao, Hou, Peng, Longchao, Wang, Lin, Jianghang, Zhang, Shengchuan, Zeng, Anxiang, Cao, Liujuan

Abstract

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

Chinese Translation

多语言文档与场景文本理解在搜索、金融和公共服务等应用中扮演着重要角色。然而，现有的大多数基准集中于高资源语言，未能在现实的多语言环境中评估模型。在东南亚，语言的多样性、复杂的书写系统以及高度多样化的文档类型使这一挑战更加严峻。我们推出了SEA-Vision，一个联合评估文档解析和文本中心视觉问答（TEC-VQA）的基准，涵盖11种东南亚语言。SEA-Vision包含来自九种代表性文档类型的15,234个文档解析页面，并附有分层的页面、区块和行级标签。它还提供了7,496个TEC-VQA问答对，探讨文本识别、数值计算、比较分析、逻辑推理和空间理解。为了使这种多语言、多任务的标注变得可行，我们设计了一个文档解析和TEC-VQA的混合流程。该流程结合了自动过滤和评分、MLLM辅助标注以及轻量级母语者验证，大大减少了人工标注的工作量，同时保持高质量。我们评估了几种领先的多模态模型，并观察到在低资源东南亚语言上的性能显著下降，突显了多语言文档与场景文本理解中仍存在的重大差距。我们相信SEA-Vision将推动全球在文档与场景文本理解方面的进展。

View on arXiv Download PDF AI Translation

cs.CL / 100 / 2603.15421

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

CLAG：通过代理驱动聚类实现小型语言模型代理的自适应记忆组织

Roh, Taeyun, Jang, Wonjune, Jung, Junha, Kang, Jaewoo

Abstract

Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.

Chinese Translation

大型语言模型代理在支持知识重用和复杂推理任务时，严重依赖外部记忆。然而，大多数记忆系统将经验存储在单一的全球检索池中，这可能逐渐稀释或损坏存储的知识。这个问题在小型语言模型（SLMs）中尤为明显，因为它们对无关上下文高度敏感。我们提出了CLAG，一个基于聚类的代理记忆框架，其中SLM代理通过聚类主动组织记忆。CLAG采用SLM驱动的路由器，将传入的记忆分配到语义一致的聚类中，并自主生成特定于聚类的概况，包括主题摘要和描述标签，以将每个聚类建立为一个自包含的功能单元。通过在这些结构化邻域内执行局部演化，CLAG有效减少了跨主题干扰，并增强了内部记忆密度。在检索过程中，该框架利用两阶段过程，首先通过聚类概况过滤相关聚类，从而排除干扰项并减少搜索空间。在多个问答数据集上进行的实验表明，CLAG在答案质量和鲁棒性方面始终优于之前的代理记忆系统，同时保持轻量和高效。

View on arXiv Download PDF AI Translation

cs.CL / 101 / 2603.15423

Invisible failures in human-AI interactions

人机交互中的隐性失败

Potts, Christopher, Sudhof, Moritz

Abstract

AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes

Chinese Translation

AI系统的失败往往是无声的，发生频率远高于显性失败。在对WildChat数据集中人机交互进行的大规模定量分析中，我们发现78%的AI失败是隐性的：虽然出现了问题，但用户并没有明确表示存在问题。这些隐性失败聚集成八种原型，帮助我们描述AI系统在何处以及如何未能满足用户需求。此外，这些原型显示出系统性的共现模式，指示出更高层次的失败类型。为了探讨这些原型在AI系统变得更强大时是否仍然相关，我们还评估了失败是主要由交互因素还是能力驱动，发现91%的失败涉及交互动态，并且我们估计即使在更强大的模型下，94%的此类失败仍将持续存在。最后，我们展示了这些原型如何帮助我们识别不同使用领域中系统性和可变的AI局限性。总体而言，我们认为我们的隐性失败分类法可以成为产品开发者、科学家和政策制定者在可靠失败监测中的关键组成部分。我们的代码和数据可在 https://github.com/bigspinai/bigspin-invisible-failure-archetypes 获取。

View on arXiv Download PDF AI Translation

cs.CL / 102 / 2603.15513

ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

ViX-Ray：一个用于视觉语言模型的越南胸部X光数据集

Nguyen, Duy Vu Minh, Truong, Chinh Thanh, Tran, Phuc Hoang, Le, Hung Tuan, Dat, Nguyen Van-Thanh, Pham, Trung Hieu, Van Nguyen, Kiet

Abstract

Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.

Chinese Translation

越南医学研究已成为一个日益重要的领域，特别是在旨在减少临床诊断中时间和资源负担的智能技术兴起的背景下。近年来，视觉语言模型（VLMs）的进展，例如Gemini和GPT-4V，激发了将人工智能应用于医疗保健的日益兴趣。然而，现有的大多数VLMs缺乏对越南医学数据的接触，限制了它们为越南患者生成准确且符合上下文的诊断输出的能力。为了解决这一挑战，我们推出了ViX-Ray，一个新颖的数据集，包含5400张越南胸部X光图像，并附有来自越南一家主要医院的专家撰写的发现和印象。我们分析了数据集中的语言模式，包括提及的身体部位和诊断的频率，以识别越南放射学报告的领域特定语言特征。此外，我们在ViX-Ray上微调了五个最先进的开源VLM，并将其性能与领先的专有模型GPT-4V和Gemini进行了比较。我们的结果表明，尽管几个模型生成的输出与临床真实情况部分一致，但它们在印象生成方面往往存在低精度和过度幻觉的问题。这些发现不仅展示了我们数据集的复杂性和挑战性，还确立了ViX-Ray作为评估和推动越南临床领域视觉语言模型发展的有价值基准。

View on arXiv Download PDF AI Translation

cs.CL / 103 / 2603.15518

Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

超越协方差陷阱：解锁大语言模型在同一主题知识编辑中的泛化能力

Liu, Xiyu, Si, Qingyi, Liu, Zhengxiao, Yang, Chenxu, Gu, Naibin, Lin, Zheng

Abstract

While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model's geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.

Chinese Translation

虽然定位-然后编辑的知识编辑方法能够有效更新大型语言模型（LLMs）中编码的知识，但在实际的同一主题知识编辑场景中出现了一种关键的泛化失败模式：尽管模型能够成功回忆起原始编辑形式的知识，但在遵循用户指令时，却无法回忆起已更新的知识。本文识别出这种泛化崩溃的几何根源在于一个根本性的冲突，即由提示变化引发的内在激活漂移超过了模型在编辑后对泛化的几何容忍度。我们将这种不稳定性归因于双重病理：(1) 采用正交梯度的联合优化将解压缩至狭窄的稳定性尖峰，(2) 标准协方差约束悖论地充当了一个协方差陷阱，放大了输入扰动。为了解决这个问题，我们引入了RoSE（稳健的同一主题编辑），该方法采用各向同性几何对齐来最小化表征偏差，并使用层次知识集成来平滑优化景观。广泛的实验表明，RoSE显著提高了模型的指令遵循能力，为大型语言模型代理的稳健交互参数记忆奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 104 / 2603.15523

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

SlovKE：用于斯洛伐克关键短语提取的大规模数据集与LLM评估

Števaňák, David, Šuppa, Marek

Abstract

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($\kappa = 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

Chinese Translation

对于形态丰富、资源稀缺的语言，关键短语提取仍然是一个研究不足的领域，主要原因在于合适评估数据集的稀缺性。为了填补这一空白，我们构建了一个包含227,432个科学摘要及作者分配的关键短语的数据集，这些数据来自斯洛伐克中央论文注册处，经过抓取和系统清理，数据量较之前最大的斯洛伐克资源增加了25倍，接近于已建立的英语基准，如KP20K。利用该数据集，我们基准测试了三种无监督基线（YAKE、TextRank、使用SlovakBERT嵌入的KeyBERT），并评估了KeyLLM，一种基于LLM（大语言模型）的提取方法，采用GPT-3.5-turbo。无监督基线的确切匹配$F1@6$最高可达11.6\%，而部分匹配最高可达51.5\%，反映了将屈折表面形式与作者分配的关键短语匹配的困难。KeyLLM缩小了这一确切—部分匹配的差距，生成了与作者分配的规范形式更接近的关键短语，同时对100个文档的人工评估（$ ext{kappa} = 0.61$）确认KeyLLM捕捉到的相关概念超出了自动化确切匹配的低估。我们的分析识别出形态不匹配是统计方法的主要失败模式，这一发现对其他屈折语言同样适用。该数据集（https://huggingface.co/datasets/NaiveNeuron/SlovKE）和评估代码（https://github.com/NaiveNeuron/SlovKE）是公开可用的。

View on arXiv Download PDF AI Translation

cs.CL / 105 / 2603.15547

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

大型语言模型能否模拟错误的学生推理？关于干扰项生成的案例研究

Zengaffinen, Yanick, Opedal, Andreas, Rooein, Donya, Srivatsa, Kv Aditya, Sonkar, Shashank, Sachan, Mrinmaya

Abstract

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.

Chinese Translation

模拟合理的学生误解对于教育中的人工智能至关重要。在本研究中，我们考察了大型语言模型（LLMs）在生成多项选择题干扰项时如何推理误解，这一任务要求通过协调解决方案知识、模拟学生误解和评估合理性来建模不正确但合理的答案。我们引入了一种分类法来分析最先进的LLMs所使用的策略，考察它们的推理过程，并将其与学习科学中的既定最佳实践进行比较。我们的结构化分析揭示了它们的过程与最佳实践之间的惊人一致性：模型通常首先正确解决问题，然后阐明并模拟多种潜在的误解，最后选择一组干扰项。对失败模式的分析表明，错误主要源于恢复正确解决方案和在响应候选者中选择时的失败，而不是模拟错误或结构化过程。与这些结果一致，我们发现，在提示中提供正确解决方案可以将与人类撰写的干扰项的对齐度提高8%，突显了在生成合理的错误学生推理时锚定于正确解决方案的重要作用。总体而言，我们的分析为LLMs模拟错误学生推理和生成高质量干扰项的能力提供了一个结构化且可解释的视角。

View on arXiv Download PDF AI Translation

cs.CL / 106 / 2603.15611

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Code-A1：通过强化学习对代码 LLM 和测试 LLM 的对抗性共同演化

Wang, Aozhe, Yan, Yuchen, Zhou, Nan, Lu, Zhengxi, Lu, Weiming, Xiao, Jun, Zhuang, Yueting, Shen, Yongliang

Abstract

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

Chinese Translation

代码生成的强化学习依赖于来自单元测试通过率的可验证奖励。然而，高质量的测试套件稀缺，现有数据集的覆盖范围有限，静态奖励无法随着模型的改进而调整。最近的自我对弈方法将代码和测试生成统一在一个模型中，但面临固有的困境：白盒访问导致自我串通，模型生成简单的测试以获取轻松的奖励，而黑盒限制则产生通用测试，无法发现特定实现的错误。我们提出了 Code-A1，一种对抗性共同演化框架，联合优化具有对立目标的代码 LLM 和测试 LLM。代码 LLM 因通过更多测试而获得奖励，而测试 LLM 则因揭示更多缺陷而获得奖励。这种架构分离消除了自我串通的风险，并安全地实现了白盒测试生成，测试 LLM 可以检查候选代码以制定针对性的对抗性测试。我们进一步引入了 Mistake Book 机制用于经验重放，以及一种复合奖励机制，在测试有效性与对抗难度之间进行平衡。在 Qwen2.5-Coder 模型上的实验表明，Code-A1 实现的代码生成性能与基于人工标注测试训练的模型相匹配或超过，同时显著提高了测试生成能力。

View on arXiv Download PDF AI Translation

cs.CL / 107 / 2603.15615

Mechanistic Origin of Moral Indifference in Language Models

语言模型中道德冷漠的机制起源

Li, Lingyu, Teng, Yan, Wang, Yingchun

Abstract

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.

Chinese Translation

现有的大型语言模型（LLMs）的行为对齐技术往往忽视了表面合规性与内部未对齐表征之间的差异，使得LLMs易受长尾风险的影响。更重要的是，我们认为LLMs由于将不同的道德概念压缩为统一的概率分布，固有地处于一种道德冷漠状态。我们通过利用基于原型理论和社会化学101数据集构建的251k道德向量，验证并修正了LLMs潜在表征中的这种冷漠。首先，我们对23个模型的分析表明，当前的LLMs未能表示对立道德类别之间的区别以及这些类别内的细粒度典型性梯度；值得注意的是，无论是模型规模、架构，还是显式对齐都未能改变这种冷漠。然后，我们在Qwen3-8B上应用稀疏自编码器，隔离单语义道德特征，并有针对性地重建其拓扑关系，以与真实的道德向量对齐。这种表征对齐自然改善了道德推理和细粒度，达到了在独立对抗性Flames基准测试中75%的成对胜率。最后，我们从经验主义哲学的角度阐述了当前干预方法的补救性质，认为内生对齐的人工智能可能需要从事后修正转变为主动培养。

View on arXiv Download PDF AI Translation

cs.CL / 108 / 2603.15619

Mixture-of-Depths Attention

深度混合注意力

Zhu, Lianghui, Fang, Yuxin, Liao, Bencheng, Wang, Shijie, Cheng, Tianheng, Huang, Zilong, Chen, Chen, Wei, Lai, Zeng, Yutao, Wang, Ya, Lin, Yi, Li, Yu, Wang, Xinggang

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

Chinese Translation

深度扩展是大型语言模型（LLMs）的关键驱动因素。然而，随着LLMs的加深，它们常常遭遇信号衰减的问题：在浅层形成的信息特征在重复的残差更新中逐渐被稀释，使得在深层中更难以恢复。我们提出了深度混合注意力（Mixture-of-Depths Attention, MoDA），一种机制，允许每个注意力头关注当前层的序列键值对（KV pairs）以及来自前面层的深度KV对。我们进一步描述了一种硬件高效的MoDA算法，该算法解决了非连续内存访问模式，在序列长度为64K时实现了97.3%的FlashAttention-2效率。在1.5B参数模型上的实验表明，MoDA始终优于强基线。值得注意的是，它在10个验证基准上将平均困惑度提高了0.2，并在10个下游任务上将平均性能提高了2.11%，计算开销仅为3.7%的FLOPs。我们还发现，将MoDA与后归一化结合使用的性能优于与前归一化结合使用的性能。这些结果表明，MoDA是深度扩展的一个有前景的原语。代码已发布在 https://github.com/hustvl/MoDA 。

View on arXiv Download PDF AI Translation

arXiv Papers

Rationale Behind Human-Led Autonomous Truck Platooning

MRPoS: Mixed Reality-Based Robot Navigation Interface Using Spatial Pointing and Speech with Large Language Model

Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning via Subtask-Level Progress Rate and Keyframe Memory for Long-Horizon Contact-Rich Robotic Manipulation

STL-SVPIO: Signal Temporal Logic guided Stein Variational Path Integral Optimization

Spatially Grounded Long-Horizon Task Planning in the Wild

Safety-guaranteed and Goal-oriented Semantic Sensing, Communication, and Control for Robotics

Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis

Fabric Pneumatic Artificial Muscle-Based Head-Neck Exosuit: Design, Modeling, and Evaluation

End-to-End O-RAN Testbed for Edge-AI-Enabled 5G/6G Connected Industrial Robotics

Creating manufacturable blueprints for coarse-grained virtual robots

Sonar-MASt3R: Real-Time Opti-Acoustic Fusion in Turbid, Unstructured Environments

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

LPV-MPC for Lateral Control in Full-Scale Autonomous Racing

Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control

Multi-Robot Coordination for Planning under Context Uncertainty

Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation

KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

Robust Sim-to-Real Cloth Untangling through Reduced-Resolution Observations via Adaptive Force-Difference Quantization

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

Building Explicit World Model for Zero-Shot Open-World Object Manipulation

ArrayTac: A tactile display for simultaneous rendering of shape, stiffness and friction

GraspADMM: Improving Dexterous Grasp Synthesis via ADMM Optimization

ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

LDHP: Library-Driven Hierarchical Planning for Non-prehensile Dexterous Manipulation

TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects

Path-conditioned Reinforcement Learning-based Local Planning for Long-Range Navigation

LineMaster Pro: A Low-Cost Intelligent Line Following Robot with PID Control and Ultrasonic Obstacle Avoidance for Educational Robotics

Data-Driven Autoregressive Power Prediction for GTernal Robots in the Robotarium

SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization

ToMPC: Task-oriented Model Predictive Control via ADMM for Safe Robotic Manipulation

Vision-guided Autonomous Dual-arm Extraction Robot for Bell Pepper Harvesting

URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation

Amortizing Trajectory Diffusion with Keyed Drift Fields

Stiffness Copilot: An Impedance Policy for Contact-Rich Teleoperation

GelSphere: An Omnidirectional Rolling Vision-Based Tactile Sensor for Online 3D Reconstruction and Normal Force Estimation

H-RINS: Hierarchical Tightly-coupled Radar-Inertial Navigation via Smoothing and Mapping

TransCurriculum: Multi-Dimensional Curriculum Learning for Fast & Stable Locomotion

See, Learn, Assist: Safe and Self-Paced Robotic Rehabilitation via Video-Based Learning from Demonstration

Towards Equitable Robotic Furnishing Agents for Aging-in-Place: ADL-Grounded Design Exploration

Navigation beyond Wayfinding: Robots Collaborating with Visually Impaired Users for Environmental Interactions

A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation

AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK

Design of a Bio-Inspired Miniature Submarine for Low-Cost Water Quality Monitoring

Load-Aware Locomotion Control for Humanoid Robots in Industrial Transportation Tasks

OmniClone: Engineering a Robust, All-Rounder Whole-Body Humanoid Teleoperation System

Data-Driven Physics Embedded Dynamics with Predictive Control and Reinforcement Learning for Quadrupeds

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning

eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping

Physics-Informed Policy Optimization via Analytic Dynamics Regularization

R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation

Architecting Autonomy for Safe Microgravity Free-Flyer Inspection

Bots and Blocks: Presenting a project-based approach for robotics education

MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer

SmallSatSim: A High-Fidelity Simulation and Training Toolkit for Microgravity Robotic Close Proximity Operations

Latent Dynamics-Aware OOD Monitoring for Trajectory Prediction with Provable Guarantees

Tactile Modality Fusion for Vision-Language-Action Models

CyboRacket: A Perception-to-Action Framework for Humanoid Racket Sports

Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation

Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection

Coordinate-Independent Robot Model Identification

A Dual Quaternion Framework for Collision Recovery of Quadrotor

LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision

CORAL: COntextual Reasoning And Local Planning in A Hierarchical VLM Framework for Underwater Monitoring

Exploring the dynamic properties and motion reproducibility of a small upper-body humanoid robot with 13-DOF pneumatic actuation for data-driven control

GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

A Unified Calibration Framework for Coordinate and Kinematic Parameters in Dual-Arm Robots

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Surgical Robot, Path Planning, Joint Space, Riemannian Manifolds

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

From Folding Mechanics to Robotic Function: A Unified Modeling Framework for Compliant Origami